Agent Observability at Internet Scale

Introduction

Agent usage is exploding, with unprecedented volumes of autonomous, dynamic, decision-making programs. They do extraordinary things — and also hallucinate, misunderstand, and occasionally cause real harm. Internet-scale agent observability isn't just APM with extra fields. It needs new primitives: trace shapes that match agent topologies, signals that capture semantic drift, and dashboards an on-call engineer can actually act on at 2am.

Why this matters

You can't debug what you can't see; agent behaviour is largely invisible without instrumentation.
Cost incidents (runaway loops, retry storms) hit faster than human reaction times.
Quality drift is silent — yesterday's agent silently became today's broken agent.
Compliance and incident response require auditable traces of automated decisions.

Core concepts

Span semantics for agent runs

A single agent task is a tree, not a line: planning span, tool-call spans, sub-agent spans, retries. Each span carries token counts, latency, and (selectively) prompt + completion content.

Three signal types

Operational (latency, errors, tokens, cost). Behavioural (tool-call rate, retry rate, escalation rate). Semantic (output quality, factuality, sentiment, schema adherence).

Drift detection

Compare today's output distribution to last week's. Look for shifts in length, tool-call frequency, refusal rate, and sample-judged quality. Alert on shifts, not absolutes.

Privacy and redaction

Agent traces are PII goldmines. Redaction must happen at capture, not query, and you need a key-management story for the cases where you're forced to capture sensitive content.

Practical patterns

OpenTelemetry for LLMs

Use OTel semantic conventions for GenAI; emit spans your existing APM stack already understands.

Sampled deep traces

Capture full content for ~1% of traces, headers + counts for the rest. Great for debugging without ballooning cost.

Shadow eval pipeline

Replay a sample of production prompts through your eval suite continuously; alarm when scores drop.

Cost guardrails

Per-tenant per-task cost ceilings enforced at the gateway level — agent loops can't bankrupt you.

Pitfalls to avoid

Logging entire prompts and completions raw — PII, IP leaks, and storage cost all spike.
Treating an agent run as a single span; you lose all the structure.
Alerting on absolute quality scores; thresholds drift, deltas are more informative.
No correlation between traces and the prompt/agent version that produced them.