AI Engineer Melbourne
Knowledge Base
AI EngineeringAdvanced 12 min

Agent Observability at Internet Scale

What to log, how to trace multi-step agents, and how to detect drift in production.

Introduction

Agent usage is exploding, with unprecedented volumes of autonomous, dynamic, decision-making programs. They do extraordinary things โ€” and also hallucinate, misunderstand, and occasionally cause real harm. Internet-scale agent observability isn't just APM with extra fields. It needs new primitives: trace shapes that match agent topologies, signals that capture semantic drift, and dashboards an on-call engineer can actually act on at 2am.

Why this matters

  • You can't debug what you can't see; agent behaviour is largely invisible without instrumentation.
  • Cost incidents (runaway loops, retry storms) hit faster than human reaction times.
  • Quality drift is silent โ€” yesterday's agent silently became today's broken agent.
  • Compliance and incident response require auditable traces of automated decisions.

Core concepts

1

Span semantics for agent runs

A single agent task is a tree, not a line: planning span, tool-call spans, sub-agent spans, retries. Each span carries token counts, latency, and (selectively) prompt + completion content.

2

Three signal types

Operational (latency, errors, tokens, cost). Behavioural (tool-call rate, retry rate, escalation rate). Semantic (output quality, factuality, sentiment, schema adherence).

3

Drift detection

Compare today's output distribution to last week's. Look for shifts in length, tool-call frequency, refusal rate, and sample-judged quality. Alert on shifts, not absolutes.

4

Privacy and redaction

Agent traces are PII goldmines. Redaction must happen at capture, not query, and you need a key-management story for the cases where you're forced to capture sensitive content.

Practical patterns

OpenTelemetry for LLMs

Use OTel semantic conventions for GenAI; emit spans your existing APM stack already understands.

Sampled deep traces

Capture full content for ~1% of traces, headers + counts for the rest. Great for debugging without ballooning cost.

Shadow eval pipeline

Replay a sample of production prompts through your eval suite continuously; alarm when scores drop.

Cost guardrails

Per-tenant per-task cost ceilings enforced at the gateway level โ€” agent loops can't bankrupt you.

Pitfalls to avoid

  • Logging entire prompts and completions raw โ€” PII, IP leaks, and storage cost all spike.
  • Treating an agent run as a single span; you lose all the structure.
  • Alerting on absolute quality scores; thresholds drift, deltas are more informative.
  • No correlation between traces and the prompt/agent version that produced them.

Key takeaways

  1. 1Adopt OTel GenAI conventions early; the ecosystem is converging.
  2. 2Capture structure, not just text. The tree of an agent run is the diagnostic.
  3. 3Redact at capture, sample wisely, and store with a retention policy.
  4. 4Watch for drift; absolute metrics lie.

Go deeper ยท external resources

Curated reading list to take you from primer to practitioner. All links are external and free to read.

More from AI Engineering