Close Your Agentic Loop

Introduction

Every time you tell an agent it broke the layout, output the wrong schema, or violated an invariant — you are the feedback loop. The teams achieving the best outcomes right now are focused on building better systems: automated feedback that lets agents check their own work. Closing the loop means moving humans out of the inner cycle so they can spend their attention on the cases that matter.

Why this matters

Human-in-the-inner-loop doesn't scale; you can't out-type the model.
Automated evaluators turn quality into a measurable signal the agent can react to.
Closed loops compound: each iteration improves; open loops oscillate.
The shape of your feedback determines the shape of your agent's behaviour.

Core concepts

Inner loop vs. outer loop

Inner loop: the agent's self-correction within a single task. Outer loop: human review and improvement of the agent's system prompt, tools, and policies. Push as much as possible into the inner loop.

Evaluators vs. validators

Validators check structural correctness (schema, syntax, types). Evaluators judge semantic correctness (does this answer the question? is this the right tone?). You need both, and they have very different cost profiles.

Runtime invariants

Define properties that must hold at every step: "no PII in outputs," "tool calls only to allow-listed endpoints," "max 5 retries." Enforce them at runtime, not in review.

Self-correction with grounding

The agent only improves if its self-judgement is grounded — in tests, schemas, citations, or external checks. Pure LLM-as-judge without grounding tends to drift.

Practical patterns

Schema-first outputs

Every agent output passes through a JSON schema or typed validator before being acted on.

Test-execute-correct

For code agents: run the code, capture failures, feed them back as the next-turn prompt.

Citation grounding

For RAG agents: every claim must cite a retrieved source; the citation is checked against the source.

LLM-as-judge with rubrics

Use a separate model with a structured rubric for semantic checks; calibrate against human labels weekly.

Pitfalls to avoid

Looping forever — always cap retries.
Using the same model as both worker and judge without diversity; it tends to agree with itself.
Treating the eval as the output. The eval scores a candidate; the candidate is still the deliverable.
Overweighting LLM-as-judge results without periodic human calibration.