Introduction
Every time you tell an agent it broke the layout, output the wrong schema, or violated an invariant โ you are the feedback loop. The teams achieving the best outcomes right now are focused on building better systems: automated feedback that lets agents check their own work. Closing the loop means moving humans out of the inner cycle so they can spend their attention on the cases that matter.
Why this matters
- Human-in-the-inner-loop doesn't scale; you can't out-type the model.
- Automated evaluators turn quality into a measurable signal the agent can react to.
- Closed loops compound: each iteration improves; open loops oscillate.
- The shape of your feedback determines the shape of your agent's behaviour.
Core concepts
Inner loop vs. outer loop
Inner loop: the agent's self-correction within a single task. Outer loop: human review and improvement of the agent's system prompt, tools, and policies. Push as much as possible into the inner loop.
Evaluators vs. validators
Validators check structural correctness (schema, syntax, types). Evaluators judge semantic correctness (does this answer the question? is this the right tone?). You need both, and they have very different cost profiles.
Runtime invariants
Define properties that must hold at every step: "no PII in outputs," "tool calls only to allow-listed endpoints," "max 5 retries." Enforce them at runtime, not in review.
Self-correction with grounding
The agent only improves if its self-judgement is grounded โ in tests, schemas, citations, or external checks. Pure LLM-as-judge without grounding tends to drift.
Practical patterns
Schema-first outputs
Every agent output passes through a JSON schema or typed validator before being acted on.
Test-execute-correct
For code agents: run the code, capture failures, feed them back as the next-turn prompt.
Citation grounding
For RAG agents: every claim must cite a retrieved source; the citation is checked against the source.
LLM-as-judge with rubrics
Use a separate model with a structured rubric for semantic checks; calibrate against human labels weekly.
Pitfalls to avoid
- Looping forever โ always cap retries.
- Using the same model as both worker and judge without diversity; it tends to agree with itself.
- Treating the eval as the output. The eval scores a candidate; the candidate is still the deliverable.
- Overweighting LLM-as-judge results without periodic human calibration.
Key takeaways
- 1Make the agent's feedback loop visible and measurable.
- 2Combine cheap structural validators with selective semantic evaluators.
- 3Cap retries and degrade gracefully.
- 4Calibrate LLM judges against humans on a regular cadence.
Go deeper ยท external resources
Curated reading list to take you from primer to practitioner. All links are external and free to read.