llm-evaluationbenchmarkragasmmluaitesting

LLM Evaluation Guide - From Benchmarks to Building Your Own Evaluation System

A comprehensive guide covering LLM evaluation concepts, major benchmarks (MMLU, HumanEval, MT-Bench), automated evaluation (LLM-as-Judge), RAG evaluation (RAGAS), building custom evaluation systems, and A/B testing strategies.

Data DynamicsApril 16, 20266 min read

Objectively measuring and comparing LLM performance is essential for model selection, Fine-Tuning validation, and production quality management. This post systematically covers LLM evaluation from basic concepts to building custom evaluation systems.

1. Why LLM Evaluation Matters

When Evaluation Is Needed

Situation	Purpose	Method
Model selection	Compare GPT-4 vs Claude vs LLaMA	Benchmarks, domain task evaluation
Fine-Tuning	Compare pre/post training performance	Task-specific accuracy, format compliance
RAG pipeline	Measure retrieval + generation quality	RAGAS, Recall@K
Prompt optimization	Compare prompt variants	A/B testing, LLM-as-Judge
Production monitoring	Continuous service quality management	User feedback, automated evaluation

Challenges

Non-determinism: Different outputs for the same input (Temperature > 0)
Subjectivity: Ambiguous criteria for "good answers"
Multi-dimensional: Accuracy, fluency, usefulness, safety — many axes
Domain dependence: Gap between generic benchmarks and real performance

2. Major Benchmarks

General Benchmarks

Benchmark	Evaluates	Format	Questions	Features
MMLU	General knowledge	4-choice	15,908	57 subjects, most widely used
MMLU-Pro	General (hard)	10-choice	12,032	Upgraded MMLU, requires CoT
HellaSwag	Commonsense	4-choice	10,042	Sentence completion
ARC-Challenge	Science reasoning	4-choice	1,172	Grade school science
TruthfulQA	Factuality	Generative	817	Hallucination evaluation
GSM8K	Math	Generative	1,319	Grade school math
MATH	Math (hard)	Generative	5,000	Competition-level math

Code Benchmarks

Benchmark	Evaluates	Languages	Problems
HumanEval	Function generation	Python	164
MBPP	Basic programming	Python	500
SWE-bench	Real issue resolution	Multiple	2,294

Conversation Benchmarks

Benchmark	Evaluates	Method
MT-Bench	Multi-turn conversation	GPT-4 scores 1-10
AlpacaEval	Instruction following	GPT-4 win rate
Chatbot Arena	User preference	ELO rating (blind)

3. Automated Evaluation: LLM-as-Judge

Concept

Use a powerful LLM (GPT-4, Claude) as an evaluator to automatically score other models' response quality.

import anthropic
 
client = anthropic.Anthropic()
 
def llm_judge(question: str, response: str, criteria: list) -> dict:
    """Use LLM as evaluator"""
    eval_prompt = f"""Evaluate the following response.
 
## Question
{question}
 
## Response
{response}
 
## Criteria (1-5 points each)
{chr(10).join(f"- {c}" for c in criteria)}
 
## Scoring Rules
- 1: Very poor, 2: Poor, 3: Average, 4: Good, 5: Excellent
 
Return JSON: {{"scores": {{"criterion": score}}, "average": avg, "reasoning": "rationale", "improvements": "suggestions"}}
"""
    result = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=1024, temperature=0.0,
        messages=[{"role": "user", "content": eval_prompt}]
    )
    return json.loads(result.content[0].text)

Pairwise Comparison

def pairwise_compare(question: str, response_a: str, response_b: str) -> str:
    """Select the better of two responses"""
    prompt = f"""Compare two responses and select the better one.
 
Question: {question}
Response A: {response_a}
Response B: {response_b}
 
Return: {{"winner": "A" or "B" or "tie", "reason": "rationale"}}"""
    result = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=512, temperature=0.0,
        messages=[{"role": "user", "content": prompt}]
    )
    return json.loads(result.content[0].text)

LLM-as-Judge Limitations

Limitation	Description	Mitigation
Position bias	Tends to prefer first response	Evaluate twice with swapped order
Length bias	Tends to score longer responses higher	Add "conciseness" criterion
Self-bias	Prefers responses from same model	Use different model as judge

4. RAG Evaluation: RAGAS

RAGAS Framework

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
 
eval_data = {
    "question": ["How to adjust Spark shuffle partitions?"],
    "answer": ["Change the spark.sql.shuffle.partitions setting..."],
    "contexts": [["Spark shuffle partitions are set via spark.sql.shuffle.partitions..."]],
    "ground_truth": ["Adjust shuffle partition count via spark.sql.shuffle.partitions."]
}
 
dataset = Dataset.from_dict(eval_data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
print(results)

Metric Interpretation

Metric	Meaning	If Low	Improvement
Faithfulness	Response grounded in context?	Hallucination	Emphasize "use context only" in prompt
Answer Relevancy	Response addresses question?	Off-topic answers	Improve prompt, retrieval quality
Context Precision	Retrieved results relevant?	Noisy documents	Add re-ranking, metadata filters
Context Recall	Found all needed info?	Missing information	Increase K, hybrid search

5. Building Custom Evaluation Systems

Evaluation Dataset Design

eval_dataset = [
    {
        "id": "eval-001",
        "category": "troubleshooting",
        "difficulty": "medium",
        "question": "Spark executor OOM error. What are the causes and solutions?",
        "reference_answer": "Increase executor.memory, increase partitions, or...",
        "required_elements": ["executor.memory", "partitions", "data skew"],
        "evaluation_criteria": {
            "accuracy": "Technically accurate root cause analysis",
            "completeness": "Includes at least 3 major causes",
            "actionability": "Includes specific config values or code",
            "format": "Structured format (numbered list, etc.)"
        }
    }
]

Automated Evaluation Pipeline

class EvaluationPipeline:
    def __init__(self, judge_model="claude-sonnet-4-6"):
        self.judge = anthropic.Anthropic()
        self.judge_model = judge_model
        self.results = []
 
    def evaluate_batch(self, model_fn, eval_dataset):
        for item in eval_dataset:
            response = model_fn(item["question"])
            auto_scores = self.auto_evaluate(item, response)
            judge_scores = self.judge_evaluate(item, response)
            self.results.append({
                "id": item["id"], "category": item["category"],
                "auto_scores": auto_scores, "judge_scores": judge_scores
            })
        return self.aggregate_results()
 
    def auto_evaluate(self, item, response):
        required = item.get("required_elements", [])
        found = sum(1 for e in required if e.lower() in response.lower())
        return {"element_coverage": found / max(len(required), 1)}
 
    def aggregate_results(self):
        return {
            "overall_average": sum(r["judge_scores"]["average"] for r in self.results) / len(self.results),
            "total_evaluated": len(self.results)
        }

6. A/B Testing and Production Evaluation

Online A/B Testing

class ABTestManager:
    def __init__(self):
        self.experiments = {}
 
    def create_experiment(self, name, variants, traffic_split):
        self.experiments[name] = {
            "variants": variants, "traffic_split": traffic_split,
            "results": {k: [] for k in variants}
        }
 
    def get_variant(self, experiment_name, user_id):
        hash_val = hash(f"{experiment_name}:{user_id}") % 100
        cumulative = 0
        for variant, ratio in self.experiments[experiment_name]["traffic_split"].items():
            cumulative += ratio * 100
            if hash_val < cumulative:
                return variant
        return list(self.experiments[experiment_name]["variants"].keys())[-1]
 
    def record_feedback(self, experiment_name, variant, score):
        self.experiments[experiment_name]["results"][variant].append(score)

Production Monitoring Metrics

Metric	Description	Collection	Target
User satisfaction	Thumbs up/down ratio	Feedback buttons	> 80% positive
Task completion	Did user get desired result	Follow-up analysis	> 70%
Response latency	TTFT, total response time	Server metrics	P95 < 3s
Hallucination rate	Factually incorrect ratio	Auto verification + human	< 5%
Cost	Token cost, daily cost	API billing	Within budget

Note: LLM evaluation is a process of "continuous improvement," not "one-time perfection." Collect failure cases from production data and add them to your evaluation dataset as a feedback loop.

References

Hendrycks, D. et al. (2021). "Measuring Massive Multitask Language Understanding (MMLU)." ICLR
Chen, M. et al. (2021). "Evaluating Large Language Models Trained on Code (HumanEval)." arXiv
Zheng, L. et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS
Es, S. et al. (2024). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL
LMSYS Chatbot Arena — https://chat.lmsys.org/

— Data Dynamics Engineering Team