Introduction
Shipping a generative feature is easy until real traffic hits. Hallucinations show up in front of real users, regressions sneak in with model upgrades, and "we'll fix it in the prompt" stops working at scale. The teams that ship reliably build an evaluation stack — offline benchmarks, online monitoring, and a release gate that turns gut-feel into a checkable signal.
Why this matters
- Without evals, every prompt change is a gamble; with evals, it's a measurement.
- Hallucinations cost trust at a non-linear rate — a few bad outputs poison the impression of the whole feature.
- Model upgrades silently change behaviour. Evals are your only early warning.
- Compliance and risk teams need evidence; evals are that evidence.
Core concepts
The eval pyramid
Unit-level (single prompt, deterministic check), task-level (golden examples with grading), online-level (production sampling and judge scoring). Build all three; the bottom is fast and cheap, the top is slow and expensive.
LLM-as-judge — when and how
Cheap, scalable, but biased. Use rubrics, not free-form judgements. Use a different model from the worker. Calibrate weekly against human labels; track judge–human agreement as a first-class metric.
Grounding checks
For RAG and assistant features, the strongest hallucination defence is checking that every claim is supported by retrieved context, with citations the user (and an automated grader) can verify.
Release gates
No prompt change ships without passing the eval suite. No model upgrade ships without a regression run. No regression bigger than X% in any segment merges without a sign-off.
Practical patterns
Golden set with provenance
A curated set of 100–1000 examples with expected behaviours, source-tagged so you can argue about edge cases later.
Production sampling
Continuously sample 1–5% of live traffic, run it through the eval suite, and alert on score drops or distribution shifts.
Citation verifier
Programmatically check that every cited source contains the claim it's cited for; flag unsupported sentences.
Segment-aware reporting
Slice eval results by user segment, language, document type. The average is rarely the story.
Pitfalls to avoid
- Treating the demo set as the eval set; demo prompts are unrepresentative.
- Single-number eval scores; they hide segment regressions.
- Trusting LLM-as-judge without recalibration as judge models change.
- No distinction between "correct" and "useful"; both need to be measured.
Key takeaways
- 1Evals are infrastructure. Fund them like infra.
- 2Build the pyramid: cheap-fast-many at the bottom, slow-expensive-few at the top.
- 3Calibrate judges; don't deploy and forget.
- 4No prompt change without an eval result.
Go deeper · external resources
Curated reading list to take you from primer to practitioner. All links are external and free to read.