Blog
llm-evaluationbenchmarkragasmmluaitesting

LLM Evaluation Guide - From Benchmarks to Building Your Own Evaluation System

A comprehensive guide covering LLM evaluation concepts, major benchmarks (MMLU, HumanEval, MT-Bench), automated evaluation (LLM-as-Judge), RAG evaluation (RAGAS), building custom evaluation systems, and A/B testing strategies.

Data DynamicsApril 16, 20266 min read

Objectively measuring and comparing LLM performance is essential for model selection, Fine-Tuning validation, and production quality management. This post systematically covers LLM evaluation from basic concepts to building custom evaluation systems.


1. Why LLM Evaluation Matters

When Evaluation Is Needed

SituationPurposeMethod
Model selectionCompare GPT-4 vs Claude vs LLaMABenchmarks, domain task evaluation
Fine-TuningCompare pre/post training performanceTask-specific accuracy, format compliance
RAG pipelineMeasure retrieval + generation qualityRAGAS, Recall@K
Prompt optimizationCompare prompt variantsA/B testing, LLM-as-Judge
Production monitoringContinuous service quality managementUser feedback, automated evaluation

Challenges

  • Non-determinism: Different outputs for the same input (Temperature > 0)
  • Subjectivity: Ambiguous criteria for "good answers"
  • Multi-dimensional: Accuracy, fluency, usefulness, safety — many axes
  • Domain dependence: Gap between generic benchmarks and real performance

2. Major Benchmarks

General Benchmarks

BenchmarkEvaluatesFormatQuestionsFeatures
MMLUGeneral knowledge4-choice15,90857 subjects, most widely used
MMLU-ProGeneral (hard)10-choice12,032Upgraded MMLU, requires CoT
HellaSwagCommonsense4-choice10,042Sentence completion
ARC-ChallengeScience reasoning4-choice1,172Grade school science
TruthfulQAFactualityGenerative817Hallucination evaluation
GSM8KMathGenerative1,319Grade school math
MATHMath (hard)Generative5,000Competition-level math

Code Benchmarks

BenchmarkEvaluatesLanguagesProblems
HumanEvalFunction generationPython164
MBPPBasic programmingPython500
SWE-benchReal issue resolutionMultiple2,294

Conversation Benchmarks

BenchmarkEvaluatesMethod
MT-BenchMulti-turn conversationGPT-4 scores 1-10
AlpacaEvalInstruction followingGPT-4 win rate
Chatbot ArenaUser preferenceELO rating (blind)

3. Automated Evaluation: LLM-as-Judge

Concept

Use a powerful LLM (GPT-4, Claude) as an evaluator to automatically score other models' response quality.

import anthropic
 
client = anthropic.Anthropic()
 
def llm_judge(question: str, response: str, criteria: list) -> dict:
    """Use LLM as evaluator"""
    eval_prompt = f"""Evaluate the following response.
 
## Question
{question}
 
## Response
{response}
 
## Criteria (1-5 points each)
{chr(10).join(f"- {c}" for c in criteria)}
 
## Scoring Rules
- 1: Very poor, 2: Poor, 3: Average, 4: Good, 5: Excellent
 
Return JSON: {{"scores": {{"criterion": score}}, "average": avg, "reasoning": "rationale", "improvements": "suggestions"}}
"""
    result = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=1024, temperature=0.0,
        messages=[{"role": "user", "content": eval_prompt}]
    )
    return json.loads(result.content[0].text)

Pairwise Comparison

def pairwise_compare(question: str, response_a: str, response_b: str) -> str:
    """Select the better of two responses"""
    prompt = f"""Compare two responses and select the better one.
 
Question: {question}
Response A: {response_a}
Response B: {response_b}
 
Return: {{"winner": "A" or "B" or "tie", "reason": "rationale"}}"""
    result = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=512, temperature=0.0,
        messages=[{"role": "user", "content": prompt}]
    )
    return json.loads(result.content[0].text)

LLM-as-Judge Limitations

LimitationDescriptionMitigation
Position biasTends to prefer first responseEvaluate twice with swapped order
Length biasTends to score longer responses higherAdd "conciseness" criterion
Self-biasPrefers responses from same modelUse different model as judge

4. RAG Evaluation: RAGAS

RAGAS Framework

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
 
eval_data = {
    "question": ["How to adjust Spark shuffle partitions?"],
    "answer": ["Change the spark.sql.shuffle.partitions setting..."],
    "contexts": [["Spark shuffle partitions are set via spark.sql.shuffle.partitions..."]],
    "ground_truth": ["Adjust shuffle partition count via spark.sql.shuffle.partitions."]
}
 
dataset = Dataset.from_dict(eval_data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
print(results)

Metric Interpretation

MetricMeaningIf LowImprovement
FaithfulnessResponse grounded in context?HallucinationEmphasize "use context only" in prompt
Answer RelevancyResponse addresses question?Off-topic answersImprove prompt, retrieval quality
Context PrecisionRetrieved results relevant?Noisy documentsAdd re-ranking, metadata filters
Context RecallFound all needed info?Missing informationIncrease K, hybrid search

5. Building Custom Evaluation Systems

Evaluation Dataset Design

eval_dataset = [
    {
        "id": "eval-001",
        "category": "troubleshooting",
        "difficulty": "medium",
        "question": "Spark executor OOM error. What are the causes and solutions?",
        "reference_answer": "Increase executor.memory, increase partitions, or...",
        "required_elements": ["executor.memory", "partitions", "data skew"],
        "evaluation_criteria": {
            "accuracy": "Technically accurate root cause analysis",
            "completeness": "Includes at least 3 major causes",
            "actionability": "Includes specific config values or code",
            "format": "Structured format (numbered list, etc.)"
        }
    }
]

Automated Evaluation Pipeline

class EvaluationPipeline:
    def __init__(self, judge_model="claude-sonnet-4-6"):
        self.judge = anthropic.Anthropic()
        self.judge_model = judge_model
        self.results = []
 
    def evaluate_batch(self, model_fn, eval_dataset):
        for item in eval_dataset:
            response = model_fn(item["question"])
            auto_scores = self.auto_evaluate(item, response)
            judge_scores = self.judge_evaluate(item, response)
            self.results.append({
                "id": item["id"], "category": item["category"],
                "auto_scores": auto_scores, "judge_scores": judge_scores
            })
        return self.aggregate_results()
 
    def auto_evaluate(self, item, response):
        required = item.get("required_elements", [])
        found = sum(1 for e in required if e.lower() in response.lower())
        return {"element_coverage": found / max(len(required), 1)}
 
    def aggregate_results(self):
        return {
            "overall_average": sum(r["judge_scores"]["average"] for r in self.results) / len(self.results),
            "total_evaluated": len(self.results)
        }

6. A/B Testing and Production Evaluation

Online A/B Testing

class ABTestManager:
    def __init__(self):
        self.experiments = {}
 
    def create_experiment(self, name, variants, traffic_split):
        self.experiments[name] = {
            "variants": variants, "traffic_split": traffic_split,
            "results": {k: [] for k in variants}
        }
 
    def get_variant(self, experiment_name, user_id):
        hash_val = hash(f"{experiment_name}:{user_id}") % 100
        cumulative = 0
        for variant, ratio in self.experiments[experiment_name]["traffic_split"].items():
            cumulative += ratio * 100
            if hash_val < cumulative:
                return variant
        return list(self.experiments[experiment_name]["variants"].keys())[-1]
 
    def record_feedback(self, experiment_name, variant, score):
        self.experiments[experiment_name]["results"][variant].append(score)

Production Monitoring Metrics

MetricDescriptionCollectionTarget
User satisfactionThumbs up/down ratioFeedback buttons> 80% positive
Task completionDid user get desired resultFollow-up analysis> 70%
Response latencyTTFT, total response timeServer metricsP95 < 3s
Hallucination rateFactually incorrect ratioAuto verification + human< 5%
CostToken cost, daily costAPI billingWithin budget

Note: LLM evaluation is a process of "continuous improvement," not "one-time perfection." Collect failure cases from production data and add them to your evaluation dataset as a feedback loop.


References

  • Hendrycks, D. et al. (2021). "Measuring Massive Multitask Language Understanding (MMLU)." ICLR
  • Chen, M. et al. (2021). "Evaluating Large Language Models Trained on Code (HumanEval)." arXiv
  • Zheng, L. et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS
  • Es, S. et al. (2024). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL
  • LMSYS Chatbot Arena — https://chat.lmsys.org/

— Data Dynamics Engineering Team