LLM Evaluation Guide - From Benchmarks to Building Your Own Evaluation System
A comprehensive guide covering LLM evaluation concepts, major benchmarks (MMLU, HumanEval, MT-Bench), automated evaluation (LLM-as-Judge), RAG evaluation (RAGAS), building custom evaluation systems, and A/B testing strategies.
Data DynamicsApril 16, 20266 min read
Objectively measuring and comparing LLM performance is essential for model selection, Fine-Tuning validation, and production quality management. This post systematically covers LLM evaluation from basic concepts to building custom evaluation systems.
1. Why LLM Evaluation Matters
When Evaluation Is Needed
Situation
Purpose
Method
Model selection
Compare GPT-4 vs Claude vs LLaMA
Benchmarks, domain task evaluation
Fine-Tuning
Compare pre/post training performance
Task-specific accuracy, format compliance
RAG pipeline
Measure retrieval + generation quality
RAGAS, Recall@K
Prompt optimization
Compare prompt variants
A/B testing, LLM-as-Judge
Production monitoring
Continuous service quality management
User feedback, automated evaluation
Challenges
Non-determinism: Different outputs for the same input (Temperature > 0)
Subjectivity: Ambiguous criteria for "good answers"
Multi-dimensional: Accuracy, fluency, usefulness, safety — many axes
Domain dependence: Gap between generic benchmarks and real performance
2. Major Benchmarks
General Benchmarks
Benchmark
Evaluates
Format
Questions
Features
MMLU
General knowledge
4-choice
15,908
57 subjects, most widely used
MMLU-Pro
General (hard)
10-choice
12,032
Upgraded MMLU, requires CoT
HellaSwag
Commonsense
4-choice
10,042
Sentence completion
ARC-Challenge
Science reasoning
4-choice
1,172
Grade school science
TruthfulQA
Factuality
Generative
817
Hallucination evaluation
GSM8K
Math
Generative
1,319
Grade school math
MATH
Math (hard)
Generative
5,000
Competition-level math
Code Benchmarks
Benchmark
Evaluates
Languages
Problems
HumanEval
Function generation
Python
164
MBPP
Basic programming
Python
500
SWE-bench
Real issue resolution
Multiple
2,294
Conversation Benchmarks
Benchmark
Evaluates
Method
MT-Bench
Multi-turn conversation
GPT-4 scores 1-10
AlpacaEval
Instruction following
GPT-4 win rate
Chatbot Arena
User preference
ELO rating (blind)
3. Automated Evaluation: LLM-as-Judge
Concept
Use a powerful LLM (GPT-4, Claude) as an evaluator to automatically score other models' response quality.
import anthropicclient = anthropic.Anthropic()def llm_judge(question: str, response: str, criteria: list) -> dict: """Use LLM as evaluator""" eval_prompt = f"""Evaluate the following response.## Question{question}## Response{response}## Criteria (1-5 points each){chr(10).join(f"- {c}" for c in criteria)}## Scoring Rules- 1: Very poor, 2: Poor, 3: Average, 4: Good, 5: ExcellentReturn JSON: {{"scores": {{"criterion": score}}, "average": avg, "reasoning": "rationale", "improvements": "suggestions"}}""" result = client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, temperature=0.0, messages=[{"role": "user", "content": eval_prompt}] ) return json.loads(result.content[0].text)
Pairwise Comparison
def pairwise_compare(question: str, response_a: str, response_b: str) -> str: """Select the better of two responses""" prompt = f"""Compare two responses and select the better one.Question: {question}Response A: {response_a}Response B: {response_b}Return: {{"winner": "A" or "B" or "tie", "reason": "rationale"}}""" result = client.messages.create( model="claude-sonnet-4-6", max_tokens=512, temperature=0.0, messages=[{"role": "user", "content": prompt}] ) return json.loads(result.content[0].text)
LLM-as-Judge Limitations
Limitation
Description
Mitigation
Position bias
Tends to prefer first response
Evaluate twice with swapped order
Length bias
Tends to score longer responses higher
Add "conciseness" criterion
Self-bias
Prefers responses from same model
Use different model as judge
4. RAG Evaluation: RAGAS
RAGAS Framework
from ragas import evaluatefrom ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recallfrom datasets import Dataseteval_data = { "question": ["How to adjust Spark shuffle partitions?"], "answer": ["Change the spark.sql.shuffle.partitions setting..."], "contexts": [["Spark shuffle partitions are set via spark.sql.shuffle.partitions..."]], "ground_truth": ["Adjust shuffle partition count via spark.sql.shuffle.partitions."]}dataset = Dataset.from_dict(eval_data)results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])print(results)
Metric Interpretation
Metric
Meaning
If Low
Improvement
Faithfulness
Response grounded in context?
Hallucination
Emphasize "use context only" in prompt
Answer Relevancy
Response addresses question?
Off-topic answers
Improve prompt, retrieval quality
Context Precision
Retrieved results relevant?
Noisy documents
Add re-ranking, metadata filters
Context Recall
Found all needed info?
Missing information
Increase K, hybrid search
5. Building Custom Evaluation Systems
Evaluation Dataset Design
eval_dataset = [ { "id": "eval-001", "category": "troubleshooting", "difficulty": "medium", "question": "Spark executor OOM error. What are the causes and solutions?", "reference_answer": "Increase executor.memory, increase partitions, or...", "required_elements": ["executor.memory", "partitions", "data skew"], "evaluation_criteria": { "accuracy": "Technically accurate root cause analysis", "completeness": "Includes at least 3 major causes", "actionability": "Includes specific config values or code", "format": "Structured format (numbered list, etc.)" } }]
Automated Evaluation Pipeline
class EvaluationPipeline: def __init__(self, judge_model="claude-sonnet-4-6"): self.judge = anthropic.Anthropic() self.judge_model = judge_model self.results = [] def evaluate_batch(self, model_fn, eval_dataset): for item in eval_dataset: response = model_fn(item["question"]) auto_scores = self.auto_evaluate(item, response) judge_scores = self.judge_evaluate(item, response) self.results.append({ "id": item["id"], "category": item["category"], "auto_scores": auto_scores, "judge_scores": judge_scores }) return self.aggregate_results() def auto_evaluate(self, item, response): required = item.get("required_elements", []) found = sum(1 for e in required if e.lower() in response.lower()) return {"element_coverage": found / max(len(required), 1)} def aggregate_results(self): return { "overall_average": sum(r["judge_scores"]["average"] for r in self.results) / len(self.results), "total_evaluated": len(self.results) }
6. A/B Testing and Production Evaluation
Online A/B Testing
class ABTestManager: def __init__(self): self.experiments = {} def create_experiment(self, name, variants, traffic_split): self.experiments[name] = { "variants": variants, "traffic_split": traffic_split, "results": {k: [] for k in variants} } def get_variant(self, experiment_name, user_id): hash_val = hash(f"{experiment_name}:{user_id}") % 100 cumulative = 0 for variant, ratio in self.experiments[experiment_name]["traffic_split"].items(): cumulative += ratio * 100 if hash_val < cumulative: return variant return list(self.experiments[experiment_name]["variants"].keys())[-1] def record_feedback(self, experiment_name, variant, score): self.experiments[experiment_name]["results"][variant].append(score)
Production Monitoring Metrics
Metric
Description
Collection
Target
User satisfaction
Thumbs up/down ratio
Feedback buttons
> 80% positive
Task completion
Did user get desired result
Follow-up analysis
> 70%
Response latency
TTFT, total response time
Server metrics
P95 < 3s
Hallucination rate
Factually incorrect ratio
Auto verification + human
< 5%
Cost
Token cost, daily cost
API billing
Within budget
Note: LLM evaluation is a process of "continuous improvement," not "one-time perfection." Collect failure cases from production data and add them to your evaluation dataset as a feedback loop.
References
Hendrycks, D. et al. (2021). "Measuring Massive Multitask Language Understanding (MMLU)." ICLR
Chen, M. et al. (2021). "Evaluating Large Language Models Trained on Code (HumanEval)." arXiv
Zheng, L. et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS
Es, S. et al. (2024). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL