Introduction
Agent usage is exploding, with unprecedented volumes of autonomous, dynamic, decision-making programs. They do extraordinary things โ and also hallucinate, misunderstand, and occasionally cause real harm. Internet-scale agent observability isn't just APM with extra fields. It needs new primitives: trace shapes that match agent topologies, signals that capture semantic drift, and dashboards an on-call engineer can actually act on at 2am.
Why this matters
- You can't debug what you can't see; agent behaviour is largely invisible without instrumentation.
- Cost incidents (runaway loops, retry storms) hit faster than human reaction times.
- Quality drift is silent โ yesterday's agent silently became today's broken agent.
- Compliance and incident response require auditable traces of automated decisions.
Core concepts
Span semantics for agent runs
A single agent task is a tree, not a line: planning span, tool-call spans, sub-agent spans, retries. Each span carries token counts, latency, and (selectively) prompt + completion content.
Three signal types
Operational (latency, errors, tokens, cost). Behavioural (tool-call rate, retry rate, escalation rate). Semantic (output quality, factuality, sentiment, schema adherence).
Drift detection
Compare today's output distribution to last week's. Look for shifts in length, tool-call frequency, refusal rate, and sample-judged quality. Alert on shifts, not absolutes.
Privacy and redaction
Agent traces are PII goldmines. Redaction must happen at capture, not query, and you need a key-management story for the cases where you're forced to capture sensitive content.
Practical patterns
OpenTelemetry for LLMs
Use OTel semantic conventions for GenAI; emit spans your existing APM stack already understands.
Sampled deep traces
Capture full content for ~1% of traces, headers + counts for the rest. Great for debugging without ballooning cost.
Shadow eval pipeline
Replay a sample of production prompts through your eval suite continuously; alarm when scores drop.
Cost guardrails
Per-tenant per-task cost ceilings enforced at the gateway level โ agent loops can't bankrupt you.
Pitfalls to avoid
- Logging entire prompts and completions raw โ PII, IP leaks, and storage cost all spike.
- Treating an agent run as a single span; you lose all the structure.
- Alerting on absolute quality scores; thresholds drift, deltas are more informative.
- No correlation between traces and the prompt/agent version that produced them.
Key takeaways
- 1Adopt OTel GenAI conventions early; the ecosystem is converging.
- 2Capture structure, not just text. The tree of an agent run is the diagnostic.
- 3Redact at capture, sample wisely, and store with a retention policy.
- 4Watch for drift; absolute metrics lie.
Go deeper ยท external resources
Curated reading list to take you from primer to practitioner. All links are external and free to read.