Fixing Production Hallucinations With Evals

Introduction

Shipping a generative feature is easy until real traffic hits. Hallucinations show up in front of real users, regressions sneak in with model upgrades, and "we'll fix it in the prompt" stops working at scale. The teams that ship reliably build an evaluation stack — offline benchmarks, online monitoring, and a release gate that turns gut-feel into a checkable signal.

Why this matters

Without evals, every prompt change is a gamble; with evals, it's a measurement.
Hallucinations cost trust at a non-linear rate — a few bad outputs poison the impression of the whole feature.
Model upgrades silently change behaviour. Evals are your only early warning.
Compliance and risk teams need evidence; evals are that evidence.

Core concepts

The eval pyramid

Unit-level (single prompt, deterministic check), task-level (golden examples with grading), online-level (production sampling and judge scoring). Build all three; the bottom is fast and cheap, the top is slow and expensive.

LLM-as-judge — when and how

Cheap, scalable, but biased. Use rubrics, not free-form judgements. Use a different model from the worker. Calibrate weekly against human labels; track judge–human agreement as a first-class metric.

Grounding checks

For RAG and assistant features, the strongest hallucination defence is checking that every claim is supported by retrieved context, with citations the user (and an automated grader) can verify.

Release gates

No prompt change ships without passing the eval suite. No model upgrade ships without a regression run. No regression bigger than X% in any segment merges without a sign-off.

Practical patterns

Golden set with provenance

A curated set of 100–1000 examples with expected behaviours, source-tagged so you can argue about edge cases later.

Production sampling

Continuously sample 1–5% of live traffic, run it through the eval suite, and alert on score drops or distribution shifts.

Citation verifier

Programmatically check that every cited source contains the claim it's cited for; flag unsupported sentences.

Segment-aware reporting

Slice eval results by user segment, language, document type. The average is rarely the story.

Pitfalls to avoid

Treating the demo set as the eval set; demo prompts are unrepresentative.
Single-number eval scores; they hide segment regressions.
Trusting LLM-as-judge without recalibration as judge models change.
No distinction between "correct" and "useful"; both need to be measured.

Key takeaways

1Evals are infrastructure. Fund them like infra.
2Build the pyramid: cheap-fast-many at the bottom, slow-expensive-few at the top.
3Calibrate judges; don't deploy and forget.
4No prompt change without an eval result.