Evaluation First

The cost of skipping this step

In 2023, two New York lawyers submitted a legal brief containing citations to six cases that didn't exist. ChatGPT had invented them — complete with plausible-sounding case names, docket numbers, and holdings. The lawyers had used the output without verification. A federal judge fined them each $5,000.

That same year, Air Canada was ordered to pay damages after its customer service chatbot promised a bereavement discount that didn't exist. The airline's defense — that the chatbot was a "separate legal entity" responsible for its own statements — was rejected. Air Canada was liable for what its AI said.

Neither of those teams thought they were being reckless. They tested their systems. The systems seemed to work. They shipped.

The gap between "it seemed to work in testing" and "it works reliably at scale" is evaluation. Not testing in the software-engineering sense — test coverage, unit tests, assertions. Evaluation: systematic, statistical measurement of output quality across a representative range of inputs.

This is the hardest problem in AI engineering. It's also where most teams invest the least.

Why evaluation is harder than testing

Unit testing works because software is deterministic. You write a function, assert the output for given inputs, and the assertion either passes or fails. Logic that worked yesterday works today.

None of those properties hold for LLM-based systems.

The output is open-ended. Ask the model to summarize a support ticket and there's no single correct answer. There are good answers and bad ones, but no string comparison will tell you which is which. You can't assert output == expected_output when expected_output could be any of a hundred valid responses.

The model is a black box. You can't trace a bad output to a line of code. You can observe inputs and outputs. You can't observe why.

Benchmarks saturate fast. The research community has played this game many times. GLUE (General Language Understanding Evaluation) was released in 2018 as a benchmark for language models. By 2019, models had matched human performance — so SuperGLUE was introduced. MMLU, introduced in 2020 to measure broad reasoning, was showing saturation by 2024, prompting MMLU-Pro. Academic benchmarks are a moving target, and a model that tops the leaderboard on MMLU might be mediocre at your actual task.

The consequence: you cannot evaluate AI systems the way you evaluate traditional software. You need different methods, different data, and a different definition of what passing looks like.

Three evaluation methods

There are three fundamentally different ways to evaluate LLM output. Each has a different cost, different validity, and appropriate uses. Most real applications need all three.

1. Exact evaluation

Exact evaluation compares model output against a reference — either programmatically or against known-correct answers.

Functional correctness is the strongest form. For code generation, you run the generated code against test cases. The code either works or it doesn't. This is why coding benchmarks like HumanEval have remained valuable longer than language benchmarks — the evaluation is ground truth.

The metric here is pass@k: given k generated solutions to a problem, what's the probability at least one passes? OpenAI's Codex achieved pass@1 of 28.8% and pass@100 of 72.3% on HumanEval when it launched in 2021. The difference matters: if you generate one solution and ship it, you get different reliability than if you generate five and pick the best-passing one.

Exact match works for short, constrained outputs: named entity extraction, classification labels, factual Q&A where the answer is a specific string. It breaks for anything generative. "Paris" either matches "Paris" or it doesn't. A 200-word summary can't be evaluated that way.

Lexical similarity — BLEU and ROUGE — extends exact matching to longer outputs by measuring n-gram overlap between the model output and a reference text. ROUGE-1 measures unigram overlap; ROUGE-2 measures bigram overlap. These are widely used in summarization benchmarks.

The problem: n-gram overlap measures surface similarity, not semantic correctness. Consider two sentences:

"My cats scare the mice."
"The mice are frightened by my cats."

Same meaning. ROUGE gives them a low similarity score because the word order is different and few n-grams overlap. Meanwhile, "My cats eat the mice" scores well despite meaning something entirely different — it shares more surface vocabulary.

Lexical metrics are useful when output consistency matters (translation, templated generation). They're unreliable for tasks where paraphrase is acceptable.

Semantic similarity uses embeddings to overcome this limitation. You embed both the reference and the model output as vectors, then compute cosine similarity. Semantically equivalent sentences cluster together in embedding space even if they use different words.

This works well for semantic equivalence checks. It's weaker for factual accuracy — two sentences can be semantically similar while one is true and one is false.

2. AI as a judge

In 2023, 6 out of 70 AI decision-makers surveyed by a16z evaluated their models through word of mouth — they shipped to a few users and asked if it seemed to work.

The teams doing better were using models to evaluate models.

The core insight: if you need to judge open-ended output quality, and hiring human raters is expensive and slow, a capable LLM can do the same job for a fraction of the cost in real time. And it performs surprisingly well. In MT-Bench evaluations, GPT-4-as-judge agreed with human raters 85% of the time. Human raters agreed with each other 81% of the time. The AI judge was more consistent than the humans.

By late 2023, 58% of LangChain platform evaluations were using AI judges.

How to structure AI-as-judge prompts

There are three patterns:

Standalone evaluation gives the judge a response and asks it to rate quality on defined criteria. No reference needed.

System: You are evaluating the quality of a customer service response.
Rate the following response on two criteria:
1. Accuracy: Does it answer the customer's question correctly? (1–5)
2. Tone: Is it professional and empathetic? (1–5)

Customer question: {{question}}
Agent response: {{response}}

Provide a score for each criterion and a one-sentence justification.

Reference-based evaluation gives the judge a reference answer and asks it to compare. Useful when you have a known-good answer to measure against.

Comparative evaluation gives the judge two responses and asks which is better, without assigning numeric scores. This is the format that powers LMSYS Chat Arena — millions of human comparisons across model pairs. The AI judge version (AlpacaEval) achieves 0.98 Spearman correlation with human arena rankings.

Scoring: discrete beats continuous

A 1–5 scale produces better signal than asking for a score between 0 and 100. With 100 points of resolution, judges — human or AI — are inconsistent about where to draw lines. The difference between 73 and 74 means nothing. The difference between 3 ("partially correct, missing key context") and 4 ("correct and complete") is calibrated and repeatable.

For many tasks, an even simpler framing works well: classify the response as pass/fail or good/needs improvement. Binary classification is the most consistent rating format.

What changes the judge

An AI judge is a combination of the model you use and the evaluation prompt you write. Changing either changes what the judge measures. A judge prompt that asks "is this response helpful?" and one that asks "does this response contain only information present in the source document?" are evaluating different things, even on the same model.

The judge itself is not ground truth. It's a proxy for human judgment. Which means: it inherits the biases of the model it's running on. GPT-4-as-judge has been shown to favor longer responses — not because longer is better, but because verbosity correlates with helpfulness in its training data. If your task favors conciseness, that's a systematic error you need to account for.

When to use AI as a judge

Use it when:

You need to evaluate at scale (thousands of examples)
Your outputs are open-ended and can't be evaluated by string comparison
Human rating would take too long to iterate at development speed
You need to track quality continuously in production

Don't use it as your only evaluation method. AI judges are probabilistic — run the same example twice and you may get different scores. For high-stakes decisions, validate your AI judge against a human-labeled sample before trusting it.

3. Comparative evaluation

Comparative evaluation doesn't ask "how good is this?" It asks "which of these is better?"

This is a different cognitive task — for humans and models. Ranking two responses requires less calibration than assigning a score. You're not trying to decide where on an abstract scale a response falls; you're deciding which of two concrete options you'd prefer.

This makes comparative evaluation particularly useful in two contexts:

Model selection. You have candidates A, B, and C. Run the same eval set through all three and compare outputs pairwise. You're not looking for an absolute quality score — you're looking for which model performs better on your specific inputs.

Prompt iteration. You changed your system prompt. Instead of asking "is the new prompt better?" in the abstract, run both through your eval set and count which version wins more comparisons. The answer is relative, but it's meaningful.

Comparative evaluation is also the basis for most human preference data collection — the same format used to train reward models in RLHF. The instinct that makes it work for model evaluation is the same instinct that makes it work for model training: humans are better at "which is better" than "how good is this."

Building an evaluation pipeline

Knowing the three evaluation methods is necessary but not sufficient. The practical question is how to turn them into a system you actually run.

Here's the process in three steps.

Step 1: Define what you're evaluating and create an evaluation guideline.

An evaluation guideline specifies what good looks like for your task. It's a document you'd hand to a human rater that describes your criteria clearly enough that two different raters would score the same response identically.

Start with your task definition from Chapter 1: what's the input, what's the output, and what does correct mean? Then make it specific:

A good support ticket summary includes the reported issue, the user's emotional state, and the resolution status.
A good summary does not include information not present in the original ticket.
A good summary is no longer than 3 sentences.

These are testable criteria. You can write an AI judge prompt directly from them. You can give this document to a labeler and get consistent results.

If you can't write this document — if you're not sure what makes a good response — you don't have a well-defined task yet. Stop and sharpen the definition before you build evaluation infrastructure.

Step 2: Collect your evaluation data.

You need two things: representative inputs and, for some evaluation methods, reference outputs.

Representative inputs are the hardest part. Your development instincts produce examples that cover cases you anticipated. They won't cover what your users actually send. The inputs that break production systems are almost never the inputs you put in your test set.

Strategies for getting representative inputs:

If you have existing users, sample from real traffic. Even 50–100 real examples are worth more than 500 you invented.
If you're pre-launch, find people who match your target user and have them generate inputs through a session.
Use edge cases you've already hit in development. Failures are more useful than successes.

Reference outputs are needed for exact evaluation and reference-based AI judging. They're expensive to create — you're paying for human judgment. Be selective. Focus reference labels on your highest-stakes evaluation criteria.

For a minimum viable eval set: 50–200 examples is enough to start detecting problems. You're not trying to prove statistical significance at this stage. You're trying to catch obvious regressions and directional improvements.

Step 3: Choose methods and run evaluations.

Match method to task type:

Task type	Recommended method
Classification, extraction	Exact match or functional correctness
Short factual Q&A	Exact match, then semantic similarity
Summarization, generation	AI-as-judge with rubric
Multi-candidate selection	Comparative evaluation

Start with the cheapest method that works for your task. Exact match catches regressions instantly with no cost. Add AI judging when you need quality measurement on open-ended outputs. Add comparative evaluation when you're choosing between options.

Choosing a model with your own data

This is where most teams leave significant performance on the table.

Public benchmarks — MMLU, HumanEval, MT-Bench, LMSYS Chat Arena rankings — are useful signals for general capability. They're weak signals for whether a model performs well on your specific task, with your specific inputs, under your specific constraints.

The canonical example: when Gemini Ultra was announced, Google reported it achieved 90.0% on MMLU, surpassing GPT-4's 87.0%. The fine print: Gemini's score used a "chain-of-thought at 32 samples" methodology. GPT-4's score used a 5-shot prompting approach. Different evaluation methods, different scores. The reported "superiority" on MMLU told you essentially nothing about which model would perform better on your task.

The right process for model selection:

Build your evaluation set first (from Step 2 above).
Run your eval set through 2–3 candidate models.
Score with whatever method fits your task.
Pick the best-performing model on your data at the cost and latency point that's acceptable for your application.

A few things this process will often show you:

The frontier model is overkill for your task. GPT-4o and Claude Opus are the most capable models available — and on well-defined, constrained tasks, GPT-4o mini or Claude Haiku often come within a few percentage points of their performance at a fraction of the cost. Run the comparison before defaulting to the most powerful option.

Model behavior varies more on edge cases than core cases. Your model of choice might handle your most common input pattern equally well across candidates. The variance emerges on unusual inputs — long contexts, ambiguous instructions, inputs with formatting anomalies. Design your eval set to include these.

Different models have different failure modes. One model might be more likely to refuse a borderline request; another might be more likely to hallucinate a specific fact type. These failures don't show up in benchmark rankings but they'll show up in your production data. Systematic failure mode analysis — where does each candidate fail, and what kind of failure is it — often matters more than aggregate pass rate.

What evaluation in production looks like

An eval pipeline is not a one-time project. It's ongoing infrastructure.

Production evaluation adds one thing that development evaluation can't: real user inputs. Your production traffic shows you what your users actually ask — which is different from what you guessed they'd ask when you built your eval set.

Close the loop:

Sample a fraction of production traffic. Even 1% is enough for volume applications.
Run it through your AI judge pipeline.
Review low-scoring outputs manually. Patterns in failures inform prompt improvements.
Route confirmed failures into your development eval set.

This is the mechanism by which your system improves over time. Development evaluation tells you whether a change made things better or worse. Production evaluation tells you what cases you hadn't anticipated.

Two metrics worth tracking continuously: overall pass rate (is quality stable or degrading?) and failure category distribution (are failures concentrated in a specific input type or task variant?). A sudden shift in either is your early warning system.

For model version changes, treat them like a dependency upgrade: run your full eval set against the new version before migrating. Model providers release updated versions continuously. The behavior you depend on can change silently. A regression that would have taken hours to find in production takes minutes to find if you run evals first.

The principle

Evaluation is not overhead. It's the core engineering discipline that makes everything else possible.

You can't improve what you don't measure. You can't detect regression without a baseline. You can't choose between two prompt variants without a way to compare them.

The developers who ship reliable AI products aren't the ones with the most creative prompts. They're the ones who built evaluation infrastructure early, run it continuously, and use it to make every decision that follows.

Chapter 4 is about the interface layer — prompts, context, and output constraints. Having an evaluation pipeline is what makes iteration there tractable instead of guesswork.

Primary sources: Chip Huyen, AI Engineering (O'Reilly, 2025), Chapter 4. Chen et al., "Evaluating Large Language Models Trained on Code" (HumanEval, OpenAI 2021). Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (LMSYS, 2023). Li et al., "AlpacaEval: An Automatic Evaluator of Instruction-following Models" (2023).