ai-agentevaluationllm-as-judgetestingai

Agent Evaluation — Measuring Quality When There's More Than One Right Answer

Operations trilogy, Part 2. Agent-specific evaluation, one level above LLM evaluation: task success rate, trajectory evaluation, regression testing, and the traps of LLM-as-judge — laid out practically in a data engineering context where many SQLs yield the same answer.

Data DynamicsJune 21, 20268 min read

"Is this agent working well?" is surprisingly hard to answer. For a classification model it ends with a single accuracy number, but a data engineering agent has many SQLs that produce the right answer for the same question. One reaches it with a join, another with a subquery. Comparing the output string against a reference keeps flagging perfectly good answers as wrong.

This is why agent evaluation is one layer trickier than ordinary LLM evaluation. This article is Part 2 of the Operations trilogy for putting a data engineering agent into production, the step that uses the trajectory data accumulated in Part 1, AgentOps observability, as fuel for measuring quality.

What you'll learn

How agent evaluation differs from LLM evaluation (outcome vs trajectory)

How to define "success" when there's more than one right answer

Trajectory evaluation — when the answer is right but the process is wrong

How to catch the side effects of a prompt change with regression testing

The traps of LLM-as-judge and how to tame it

Operations trilogy — Part 1 AgentOps observability · Part 2 Agent Evaluation (this article) · Part 3 Agent Security

The fundamentals of LLM evaluation (metrics, dataset construction, automatic/human evaluation) are laid out in the LLM Evaluation Guide. This article adds the axes unique to agents on top of that.

1. Outcome Alone Isn't Enough — Two Axes of Evaluation

LLM evaluation usually looks at the output: how close it is to the reference. For agents, one more axis attaches: how it got there — the trajectory.

Loading diagram…

Why do you need both axes? Because the answer is right but the process is terrible is common. It produced the right answer but ran three table full-scans and self-corrected five times — outcome eval passes, but it's a cost bomb in production. Conversely, getting the answer right by luck is dangerous too (a wrong join happens to yield the same count). You have to look at the process to see real skill.

Evaluation axis	Question	Measured by
Outcome	Is the final answer right?	reference match, result-set equivalence
Trajectory	Is the process reasonable?	step count, tool choice, retries, cost

2. Defining "Success" — When There's More Than One Right Answer

If string comparison won't do, what judges success? In a data engineering context there are, fortunately, more robust criteria.

Loading diagram…

The usable success criteria, by level:

Execution accuracy — compare the execution result set, not the SQL string. Set comparison ignoring order and column order is the key. It's the de facto standard for Text-to-SQL evaluation.
Task success — judge by final state, like "was the table created and did it pass quality rules." Fits transform/pipeline work.
Constraint satisfaction — even if the result is right, mark it a failure if it exceeds operational constraints like row limit, scan size, or runtime.
Human preference — for cases automatic judgment can't settle, humans label them into a golden dataset.

One core principle: evaluate the effect, not the output. Grading on what the agent achieved rather than what it wrote makes evaluation robust.

3. Trajectory Evaluation — Grading the Process

Trajectory evaluation takes the trace/span data accumulated in Part 1 directly as input. This is where observability becomes the fuel for evaluation.

Loading diagram…

Trajectory metrics to watch in particular:

Step efficiency — actual step count vs the ideal path. Growth signals an ambiguous prompt or tool description.
Tool-selection accuracy — did it call the right tool with the right args? Wrong tool calls are a primary source of cost/risk.
Resilience — did it recover after an error, or loop on the same mistake (the "loop hell" from Part 1)?
Groundedness — is the final answer based on actual tool return values, or plausibly fabricated? The core of hallucination detection.

4. Regression Testing — Stopping One Prompt Fix From Breaking Everything

The scary part of agent development: fix one line of a prompt and you don't know what breaks. A change that fixes case A quietly breaks cases B and C. So agents, like ordinary software, need a regression test suite.

Loading diagram…

Here Parts 1 and 2 interlock. The failed trajectories caught by observability become regression test cases. Bundle a question that was once wrong in production as a fixed case, and the moment the same bug comes back, CI catches it. The discipline of testing transform logic in verifiable units is the same spirit as the PySpark Testing Guide, and the mindset of treating prompts "verifiably" also touches the Spec-Driven Development series.

A practical tip: bind evaluation into CI to run automatically, and block the merge when the score drops below a threshold. Rely on human willpower and it will always get skipped.

5. LLM-as-Judge — Powerful, but You Have to Tame It

For areas automatic grading can't settle (appropriateness of a description, answer quality), LLM-as-judge — handing the grading to another LLM — is common. It's fast, cheap, and scalable, but it's full of traps, so don't trust it as-is.

Loading diagram…

Known traps and responses:

Position bias — when comparing two answers, prefers the one presented first. → Ask twice with the order swapped and check consistency.
Verbosity bias — rates long, flashy answers better. → Specify "conciseness" in the rubric and control for length.
Self-preference — the judge favors outputs from its own model family. → Use different models for generation and grading.
Rubric drift — grading by "vibe" without clear criteria. → Give a concrete rubric with a definition stamped on each score, plus few-shot examples. Good rubric design overlaps with the techniques in Prompt Engineering in Practice.

The most important principle: evaluate the judge too (meta-evaluation). Periodically cross-check the LLM judge's scores against a sample of human labels to measure how well the judge agrees with humans. If the judge isn't trustworthy, everything built on top of it is a sandcastle. And where you can, prefer deterministic verification (like execution accuracy) over LLM judgment — if a result-set comparison is possible, there's no reason to ask an LLM.

6. The Evaluation Pipeline — Tying It All Together

Tie the pieces so far into a single flow and it looks like this.

Loading diagram…

Once this pipeline is in place, every time you change a prompt, model, or tool, you know by the numbers whether it got better or worse. The conversation shifts from "it feels like it got better" to "execution accuracy +4pp with no regression."

Closing — You Can't Improve What You Can't Measure

If Part 1 was "you can't fix what you can't see," the lesson of Part 2 is "you can't improve what you can't measure." The heart of agent evaluation is dropping the urge to compare outputs literally and instead evaluating the effect (result-set equivalence) and grading the process (trajectory), then nailing it down with regression tests bound into CI. LLM-as-judge is powerful, but don't trust it before you've tamed it.

Now, with observability (Part 1) and evaluation (Part 2), we've made the agent visible and measurable. The last bridge is safety. An agent that holds tools and touches real systems becomes, if poorly designed, an attack surface itself. The next Part 3, Agent Security, covers the threats and defenses beyond prompt injection.

In one sentence: Evaluate an agent by effect, not output; watch both the outcome and trajectory axes; and only when nailed down with regression tests bound to a CI gate does quality become engineering.