Observability for AI Agents — Spans, Traces, Metrics
The Span-per-Tick Model
What is span-per-tick tracing?
Span-per-tick tracing is the practice of modelling every agent tick as a distributed-tracing span: one root span per task, one child span per tick, and one nested span per tool call inside the tick. The tree of those spans is the trace, and the trace is what you reach for when something breaks. The previous module made the harness survive crashes; this module makes the surviving run legible — durability without observability is a black box that recovers without telling you why it had to.
The shape is borrowed almost wholesale from the distributed-systems tracing world. An agent task is structurally a distributed system in miniature: one orchestrator (the harness), several stateless callees (the model API, each tool), and a request that fans out and joins back across them. The vocabulary that worked for microservices works here for the same reason — the same fan-out, the same need to follow a request across process boundaries.
Why distributed tracing maps cleanly onto agents
Distributed tracing was invented to debug requests that touch many services. A web request hits the API gateway, which calls auth, which calls user-service, which calls Postgres; if the request is slow, you want to know which hop ate the time. The answer is to thread a trace id through every hop, record a span per hop with start time, end time, and attributes, and visualize the tree.
An agent task has the same shape. The harness drives a tick loop; each tick calls the model, then dispatches one or more tool calls; each tool reaches out to its own downstream system. Threaded through it all is a single task id. Drop the same span machinery on top of the agent loop and you get the same payoff: which tick was slow, which tool failed, what the model saw at that tick, what it decided.
A reasonable mapping:
| Agent concept | Trace concept | Typical duration |
|---|---|---|
| Task (a multi-tick conversation) | Root span | Seconds to minutes |
| Tick (one model call + its tool calls) | Child span of the task | ~1–10 seconds |
| Model call | Child span of the tick | ~0.5–5 seconds |
| Tool call (Stripe, Postgres, search) | Child span of the tick | ~0.05–2 seconds |
| HTTP request inside a tool | Grandchild span | ~0.01–0.5 seconds |
The depth caps out around three or four levels in practice. Deeper trees usually mean the agent is calling other agents — Module 8 of this track deals with that case explicitly.
What goes on each span
OpenTelemetry's tracing spec is the lingua franca here, and most agent observability vendors export OTel-compatible spans because the ecosystem of collectors and storage backends already speaks it. The mechanical fields — trace id, span id, parent span id, start, end, status — are non-negotiable. The interesting part is the attribute set you attach to each span. A workable starting kit for tick spans:
agent.task_id— the durable task identity from Module 1's checkpoint row.agent.tick_number— which tick this is in the task.agent.model— the exact model id (claude-sonnet-4-5-20250929, not justclaude).agent.tools_called— list of tool names invoked.agent.input_hash/agent.output_hash— content-addressed hashes of the model input and output; the payloads themselves go to a separate blob store and are addressed by hash (Step 2 covers why).agent.tokens_in/agent.tokens_out— for cost attribution later, in Module 4.status— OK / ERROR / TIMEOUT.
Each child tool span gets its own narrower set: tool name, input hash, output hash, downstream status code, idempotency key (the same key from Module 1, Step 2, so a replay or retry can be matched to its original).
These agent.* names are an app-owned convention. OpenTelemetry's GenAI semantic conventions define a parallel gen_ai.* namespace — gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.operation.name — that most managed trace backends already index. Use the gen_ai.* keys for the model-call portion of each tick so vendor dashboards light up automatically, and keep the agent.* keys for the orchestrator-level fields (task id, tick number, idempotency key) that the OTel spec does not cover yet.
The attributes are the part you tune over time. The structure — root span, tick spans, tool spans, tied together by trace id — is the part you commit to early because it is expensive to change later.
The trace IS the debugger
A durable harness without traces is a thing that survives crashes silently. Something went wrong at 03:14 UTC, the task ended in a failed state, and your only artifact is the final state row. You can guess what happened from the prompt and the last model response, but you cannot watch the run unfold.
With span-per-tick traces, the failed task is a tree you can walk. You see that tick 3 took 12 seconds when ticks 1 and 2 took 1.5 seconds each. You drill into tick 3 and see that the model call was 0.9 seconds (normal) and the lookup_order tool span was 11 seconds (not normal) and its child HTTP span was a timeout against the order service. You see that the model's next tick decided to retry, that retry succeeded, but the user-facing response that came out three ticks later cited the wrong order id because the retry returned a different order — and now you have the actual bug, not the symptom.
That hop from "the task failed" to "here is the exact sequence and the exact data each step saw" is the difference observability buys you. Step 3 of this module shows how to take that captured trace and replay it — re-run the same model call against the same recorded inputs, change one thing, watch the trace fork — which is the natural next move once you have spans worth replaying.
A concrete trace: task T-77
Continuing the customer-refund example from Module 1: a task T-77 asks the agent to refund order A-1234. The trace tree, sketched:
[task T-77] 5.2s OK
├─ [tick 1] model + lookup_customer 1.1s OK
├─ [tick 2] model + lookup_order 1.4s OK
├─ [tick 3] model + issue_refund 2.0s OK
│ ├─ model call (claude-sonnet-4-5) 0.6s OK
│ └─ tool: issue_refund 1.4s OK
│ └─ POST /v1/refunds (stripe) 1.3s 200
└─ [tick 4] model + send_email 0.7s OK
When something breaks at tick 3 — say the Stripe API returns a 503 — the tree carries the failure up:
[task T-77] 7.8s ERROR
├─ [tick 1] model + lookup_customer 1.1s OK
├─ [tick 2] model + lookup_order 1.4s OK
└─ [tick 3] model + issue_refund 5.3s ERROR
├─ model call (claude-sonnet-4-5) 0.6s OK
└─ tool: issue_refund 4.7s ERROR
├─ POST /v1/refunds 1.2s 503
├─ POST /v1/refunds (retry 1) 1.1s 503
└─ POST /v1/refunds (retry 2) 1.0s 503
The status propagates from the failing HTTP span up through the tool span, up through the tick span, up to the root. You see the retry attempts as siblings under the tool span (Module 1 Step 4's retry policy is what generated them). You see exactly where to look without reading a log file linearly.
Where Module 1 left off
A one-sentence recap: idempotency, checkpoints, and retry policy let the harness survive crashes; span-per-tick tracing lets you see inside the runs that survived. The two layers compose. The idempotency key from Module 1 Step 2 shows up as an attribute on the tool span here. The checkpoint write from Module 1 Step 3 is its own short span between ticks. The retry attempts from Module 1 Step 4 are sibling spans under the tool. Observability is not a separate stack from the harness — it is the same stack instrumented.
Where to put the trace store itself is a real choice. Vendor options — Honeycomb, Datadog APM, agent-specific products like LangSmith and Langfuse — all speak OTel and trade depth-of-features against price. Self-hosted options (Tempo, Jaeger) work, especially for teams that already run a tracing backend for their non-agent services. The choice mostly comes down to whether you want one trace store for everything you run or a specialized one for agent traces; both shapes work in practice.
Try it: Stage 1 of the right pane is the customer-refund trace for task T-77 — the same tree sketched above, rendered as a real waterfall. Press Inject error at tick 3 and watch the status propagate from the failing tool call up through the tick span and on to the root. That propagation is what makes a trace a debuggable artifact: you see the failure and the exact context it happened in without reading a single log line.
What to Log (and What Not To)
What should you log from an agent?
The per-tick fields that earn their place in the log: the model input (system prompt + history + the current user/tool messages), the model output (text and any tool calls it emitted), each tool call's inputs and outputs, latency for the tick and for each tool, and token counts (input and output). Everything else is either derivable from those or noise. Step 1 put the structural skeleton in place — spans tied together by trace id — and this step decides what data actually rides on those spans.
The good news: the list is short. The bad news: each item has a real "and what about when it's huge or sensitive?" footnote, which is most of what this step is about. Logging an agent naively — print(state) at every tick into a Postgres table — works fine for a demo and falls over hard the moment the system is real. The two failure modes are storage explosion and a PII incident, and both are avoidable with patterns that are well-known but easy to skip on a deadline.
How do you keep log size from exploding?
Agent conversations grow. By tick 10 of a tool-heavy task, the model input is the system prompt plus nine prior turns of messages and tool results. If you log the full input at every tick, you are writing roughly O(N^2) bytes for an N-tick task, because every tick re-logs every prior tick's content as part of its input. As a back-of-envelope, a 20-tick conversation with a ~4KB average per-tick payload can easily push tens of KB of duplicated text per tick into your log store, and the bill scales accordingly — exact numbers depend on your prompt and tool-result shapes, but the growth is quadratic in tick count no matter the constants.
The fix is the blob-hash pattern, borrowed wholesale from how content-addressed storage works elsewhere. The log row keeps a content hash, and the actual payload goes to cheap object storage:
log row:
trace_id, span_id, tick, model_in_hash, model_out_hash,
tool_calls = [{name, in_hash, out_hash, status, latency_ms}]
blob store:
sha256("system prompt + 9 turns of history") -> the actual text
Three benefits fall out. First, the log row is small and structured — fast to query, cheap to retain. Second, the blob store deduplicates identical inputs across ticks and across tasks; the system prompt is stored once even if it appears in millions of tick inputs. Third, lifecycle gets easier — most teams in practice retain the small log rows for many months while expiring the heavy payload blobs on a shorter window (weeks, tuned to whatever your privacy and debugging policies require), and old traces still let you see that something happened even after the payloads have aged out.
A few notes from teams that have run this in production. The blob store should be content-addressed (the key is the hash) so writes are naturally idempotent — Module 1 Step 2's pattern, applied here. The hash should cover only the parts you actually need to reconstruct, not the timestamp or task id. And the read path — looking at a tick in the UI — has to fetch the blob, which adds a hop; a small in-memory or Redis cache in front of the blob store keeps the trace UI snappy.
How do you handle PII?
This is the part that gets a team into the news if you skip it. Agent inputs and outputs are full of names, emails, phone numbers, payment details, address fragments, account numbers. By default, all of that lands in your trace store, which is touched by every engineer who debugs an issue, every dashboard that aggregates traces, every backup, and every analytics pipeline downstream.
The rule: redact at the logging boundary, not at read time. Once PII is in the store, every downstream tool that touches the store has inherited the obligation to protect it. Engineers reading the trace, the BI tool that summarizes errors by region, the embedding pipeline that scores traces for similarity — all of them now hold PII. Redact on the way in and you contain the surface to the agent runtime itself.
A workable starting kit covers the high-confidence classes:
- Emails — regex on the canonical pattern; replace with
<REDACTED:email>or a deterministic hash if you need to correlate across ticks. - Phone numbers — regex for the most common formats in your user base; international numbers are a longer tail you tackle by region.
- Government identifiers — SSN, national ID, tax id; regex per country.
- Payment instrument hints — strings that look like card PANs (Luhn-check before redacting to cut false positives), CVV-shaped numbers near the word "cvv", IBAN patterns.
- Authentication artifacts — anything that looks like an API key (high-entropy random strings), bearer tokens, OAuth codes.
Regexes alone are not perfect — a sentence like "the user said his email is jane at example dot com" sails past most patterns. The right escalation for teams that handle a lot of free-text PII is a model-based redactor as a second pass, accepting some latency and cost for higher recall. Common practice is to keep the regex as the floor (it is fast and catches the obvious) and run the model redactor on the spans flagged by the regex as likely PII-bearing.
Two things to log instead of the redacted payload, when correlation matters: a stable hash of the original value (so you can answer "did we see this email before?" without storing it), and the redaction's type (email, phone, payment_hint). The hash is enough to debug repeat-customer cases; the type lets dashboards count what is being redacted without exposing what was redacted.
What about sampling?
Every trace your agent emits is a candidate write to the trace store. At low scale you keep all of them. At higher scale — a fleet handling thousands of tasks per minute — keeping every span is expensive, and the right move is to sample. The two main shapes:
Head-based sampling decides at the start of the trace whether to keep it, usually as a probability roll (e.g. 1-in-10 traces). Cheap to implement, easy to reason about; the downside is you make the keep-or-drop decision before you know whether the trace failed.
Tail-based sampling buffers spans for the duration of the trace, then decides at the end based on what happened. The pragmatic policy that most teams converge on: keep every trace that ended in a failure, keep a sampled fraction of successful traces, and within the successful sample, bias toward the slow or expensive ones. The implementation is harder — you need a buffer that holds a trace's spans until the root closes — but the data you get is dramatically more useful, because the trace store is now denser in the cases you actually want to debug.
OpenTelemetry's sampling concepts cover both modes. Vendor-side, most managed trace stores expose a tail-sampling layer that you configure rather than implement.
What should you NOT log?
Three categories to keep out of the trace store entirely, even hashed.
Live credentials. Bearer tokens, API keys, OAuth secrets, session cookies. Hashes do not help here — if an attacker has the hash and a guessable rotation policy, they have leverage they would not otherwise have. Strip these at the boundary, do not record them.
Full PII when you have a redacted version. This is the inverse of the redaction rule: once the boundary has produced a redacted payload, the unredacted one should not exist anywhere in the trace pipeline. The temptation to "log the raw version for debugging, redact for display" is the temptation that gets a team paged.
Vector embeddings inline. A single embedding is 4–8 KB of floats; a tick with a few retrieved documents pulls in tens of KB of embeddings if you log them directly. They are also not human-debuggable. If you need to log "the agent retrieved doc id 42", log the id and the metadata — not the 1024-dimensional float vector.
A useful adjacent recap from Module 1: the idempotency key belongs in the log as an attribute on the tool span. It is the join key that lets you trace one logical refund across the original attempt, the retry, and the resumed run after a crash. What does not belong in the log is the customer payload that uniquely identifies the user — log the key, store the payload separately under the redaction rules above.
How does this fit with the offline evals you already have?
Track A Module 7 covered offline evals: pre-shipped golden cases scored against a candidate version. Production logging is the source data for the next generation of those golden cases — every failed trace you keep is a candidate to be cleaned up, redacted, and folded back into the offline set. The two layers are complements, not duplicates: offline evals catch regressions before users see them; production logs catch the failures the offline set did not predict. Module 5 of this track closes that loop explicitly.
Try it: Stage 2 of the right pane is a single tick's log entry, rendered as the structured row plus the underlying blob. Toggle PII redaction on and watch the email and order id swap to <REDACTED:email> and <REDACTED:order_id>. Toggle large blob → hash and watch the inline tool result shrink from ~18 KB to an 84-byte hash pointer, with the original content moved to a separate addressable store. The choices here are the choices you make in code.
Trace Replay
What is trace replay?
Trace replay is a debugger built on the trace store: given a failing task's recorded trace, re-run the same model call with the same captured inputs against a side environment, optionally change one thing — the model id, the system prompt, the temperature — and watch the new trace fork from the old one. Steps 1 and 2 built the trace store; this step uses it. The captured trace was an artifact; replay turns it into an experiment.
The reason this is interesting for agents specifically is that the alternative — "re-run the failing task and see if it fails again" — is unreliable in ways unit tests are not. Agent runs are stochastic. The model API, even at temperature 0, can return different tokens between two calls because of routing changes, prompt-cache warming, and minor SDK or backend updates. Re-running a failing task naively gives you a different run, possibly successful, and you have learned nothing about the original failure.
Why is replay different from re-running?
A unit test re-runs deterministic code against fixed inputs. A failed agent task is the output of stochastic code (the model) against inputs that have themselves drifted (tool results depend on the world; the world has moved on). Re-running the task without pinning either side gives you a different sample from the same distribution — sometimes informative, often not.
Replay pins both sides. The captured trace already has:
- The exact model input at the failing tick — system prompt, every prior message, every prior tool result, in the same byte-for-byte order the original run saw.
- The exact model id and parameters the original call used.
- The exact tool results that the model had seen up to that point.
Replay reissues the model call with that input and those parameters against a side gateway. Because you have pinned everything the original run saw, you get something much closer to a deterministic reproduction: the model's stochasticity is the only remaining axis, and at temperature 0 against the same model build, that axis is narrow enough to be useful. You can now vary one thing — swap the model id, edit the system prompt, change a single user message — and attribute the difference in the resulting trace to that change.
What does the mechanics actually look like?
The shape of a replay call:
const original = await traceStore.getTrace("T-77");
const failingTick = original.spans.find(s => s.name === "tick.3");
const replay = await replayClient.run({
trace: original,
upToTick: 3, // run prefix from logged tool results
override: { model: "claude-opus-4-7" }, // try a stronger model on the failing tick
toolMode: "mock-from-trace", // reuse the captured tool results, do not call real tools
});
Three pieces are doing work. upToTick says how far into the trace to replay: from tick 1 if you want to re-run the whole task, from the failing tick if you want to isolate the model call that broke. override is the one thing you change — model, prompt, parameter — and the replay UI is built around making that change cheap. toolMode decides whether tool calls hit the real downstream systems (almost never what you want) or get mocked from the captured tool results in the trace.
The captured trace itself is the system of record. The blob-hash pattern from Step 2 makes this work: the model input at tick 3 is addressed by hash, and replay reconstitutes it from the blob store. The idempotency keys from Module 1 Step 2 mean that even if a replay accidentally hits a real tool, the downstream system deduplicates against the original call — replay-as-experiment is an order of magnitude safer in a harness that already has idempotency in place.
What can you use it for?
Two main shapes show up.
Incident response. A failed task came in from support. You pull the trace, replay it, see the failure reproduce. Now you have a hypothesis — "the model is misinterpreting the order id format" — and replay lets you test the hypothesis in seconds: swap the system prompt to clarify the format, replay, watch the trace succeed. The same trace replayed against the same model produces (close to) the same failure; against the edited prompt produces a different decision; the difference attributes cleanly to the edit. Module 7 of this track frames the broader incident-response flow that wraps around this primitive.
Prompt and model A/B. You have a candidate prompt change. Before shipping it, replay a batch of recent traces against both the old and the new prompt and compare the fork rates: how often does the new prompt take the same action, how often does it diverge, and when it diverges, which version was right. This is offline eval in disguise — Module 5 of this track formalizes it as a production-evals workflow on top of the replay primitive.
Anthropic's Building Effective Agents frames agents as systems whose internal decisions are observable in principle — replay is what makes that "in principle" practical, because you can interrogate any decision the agent made by re-running it.
Side cluster vs production
The non-negotiable rule: never replay against production tools. Replay is the operation that lets you run an experiment on yesterday's failed task. If "the experiment" calls the real Stripe API, the customer gets refunded again. If it calls the real email tool, the user gets a second copy of last week's notification. The dollar cost is bad; the trust cost — sending real users real artifacts from a debugging session — is worse.
Two practical setups keep replay safe:
Mock tools from the captured trace. The replay run reads tool results from the trace store instead of calling real tools. Cheapest setup, works for the majority of cases (incident reproduction, prompt experiments), and has zero risk of side-effecting a real downstream. The model call still hits a real provider — there is no useful way to mock a model — but tool calls are pinned to their captured results.
A sandboxed tool gateway. A separate set of credentials and endpoints for the same downstream systems, pointed at test accounts. Stripe test mode, an email provider sandbox, a side Postgres with seeded data. More work to set up, and only worth it if you are doing replay-with-fresh-tools (e.g. testing how the agent reacts to different tool outputs than the original run saw). Most teams pick the mocked path first and reach for the sandbox only when they hit its limits.
Either way, replay traffic should be tagged so it never lands in your production metrics. The dashboards from Step 4 should filter out replay traces by default — a successful replay should not look like a successful real user run, or your success-rate dashboard quietly inflates.
Where replay runs out of road
Replay is powerful but it has real limits, and being honest about them avoids the false-confidence failure mode where a successful replay reassures you about a bug that is not actually fixed.
Tool side effects you cannot mock. Some tool outputs depend on the live state of the world: "what is the user's current calendar availability", "what is the current account balance". A captured trace from yesterday has yesterday's availability. Replaying with that captured tool result tests how the model would have decided given that data; it does not tell you how it will decide today against the live calendar. The replay is useful for prompt-engineering work; it is not a substitute for shipping the candidate behind shadow traffic (Module 5).
Data drift. The order database has moved on. The customer record has been updated. If the replay's hypothesis involves the model interacting with the world as it is now, captured-from-trace tool results will mislead you. The fix is to be deliberate about which axis you are testing — model behavior on captured inputs, or model behavior on fresh inputs — and pick the replay mode accordingly.
Model-provider non-determinism. Even with temperature pinned to zero and the same model build, providers reserve the right to route differently or update the model under the hood. Anthropic and OpenAI both pin model behavior more tightly than they did a couple of years ago, but the floor is not zero. In practice, a replay that succeeded once and fails once is usually telling you the prompt is on the edge of the model's competence — which is itself useful signal, but distinct from "the bug is fixed."
The honest reading: replay is the highest-leverage debugging tool the trace store unlocks, and most failed agent tasks are tractable through it. It is not a deterministic test. It is a faster way to form and discard hypotheses.
Try it: Stage 3 of the right pane is the same failed customer-refund trace from Step 1. Click Replay (same model, same prompt) — the failure reproduces against the captured inputs. Click Swap model to claude-opus-4-7 or Edit system prompt — the replay forks and succeeds. The point: ten seconds of experimentation per hypothesis, against a real captured run.
Metrics That Matter
Which five metrics define agent health?
The five that earn the top of the dashboard: success rate on a held-out eval set, p95 ticks-per-task, p95 latency-per-task, dollars-per-task, and policy-filter block rate. Almost every dashboard you actually need points at one of those, and almost every dashboard you do not need points at something else. Steps 1–3 gave you spans, log data, and the ability to replay any captured run; this step decides which numbers, derived from those spans, you stare at when nothing is on fire.
The hard part is not adding metrics — every observability vendor will happily compute hundreds for you. The hard part is committing to a short list that defines "the agent is healthy" and letting everything else be supporting evidence. Teams that skip this step end up with 40 dashboards and an on-call rotation that does not know which one to look at first.
Why "vanity" metrics mislead
A vanity metric is one that looks healthy in normal conditions and stays looking healthy when something is going wrong. They are easy to dashboard — they come straight out of any HTTP or LLM SDK — and easy to feel proud of, which is the trap.
- Raw QPS to the agent endpoint. It tells you traffic is arriving. It does not tell you the agent is doing the job correctly.
- Tokens per second out of the model. It tells you the model is producing text. It does not tell you the text is right, or that the agent stopped at the right moment.
- Average latency. A 50/50 mix of 200 ms responses and 30 s responses has the same average as a steady stream of 15 s responses — and the user experience of those two worlds is wildly different.
- Mean tool-call duration. The fast tools dilute the slow ones. The slow one is the bug.
The diagnostic case to keep in mind: a model-quality regression silently doubles the number of ticks the agent takes to finish a task (because it keeps getting things wrong and the retry loop from Module 1 Step 4 cleans up after it), which doubles cost-per-task, and lowers user-perceived success rate. Vanity metrics during this regression: QPS unchanged, tokens-per-second slightly up. The agent looks fine. The actual situation: success rate dropped, cost is up, and the only reason latency does not also scream is because the per-tick latency of the cleanup retries is fast. Module 1 Step 4 said retries amplify outages if you are not careful; this is what "not careful" looks like at the metrics layer.
Why per-task aggregation beats per-call
An agent's unit of work is a task, not a model call. A task can take one tick or twenty. Aggregating at the per-call level treats those twenty ticks as twenty equally-weighted observations and dilutes the signal from the few important ones (the failing task that took twenty ticks counts the same as twenty cheap successful ticks from twenty different healthy tasks).
The fix is mechanical: every metric the dashboard cares about is computed per task. p95 ticks-per-task. p95 latency-per-task. dollars-per-task. success-per-task. The tick spans from Step 1 give you a clean way to do this — the root span knows when the task started, when it ended, how many child ticks it had, and the cost rolls up from the per-tick model-call attributes.
A useful side effect: per-task metrics line up directly with the offline eval framing from Track A Module 7. The eval set scores success per task. The production dashboard scores success per task. Same unit of analysis, and Module 5 of this track wires them together explicitly.
Why p95 and p99, never just the mean
Means hide tail problems, and tail problems are most of what users complain about. A typical agent latency distribution is bimodal-ish — a fat cluster of fast tasks that finished in two or three ticks, a long tail of slow tasks that hit a retry or wandered for ten ticks. The mean sits in the middle of nowhere; the p95 is on the slow shoulder where the real users are; the p99 is the worst-case behavior an oncall engineer wants to know about.
Pair every latency-shaped metric with at least p95, and prefer p95 over the mean when there is a single number on the dashboard. Google's SRE Workbook on alerting on SLOs makes the case for percentile-based service-level objectives over mean-based ones, and the same argument applies at the agent layer with one additional twist: agents have multi-tick variance baked in, so the tail is wider than for a normal request-response service.
What "five golden metrics" looks like
A concrete dashboard, with realistic anchor numbers from the kind of customer-service agent we have been using as the running example. The numbers are illustrative and team-specific in reality — use them to calibrate the shape of the dashboard, not the absolute targets.
| Metric | Healthy baseline | Pages you when |
|---|---|---|
| Success rate (held-out evals, replayed daily) | ~94% | Drops below ~88% |
| p95 ticks per task | ~4 | Climbs above ~7 |
| p95 latency per task | ~7 s | Climbs above ~14 s |
| Dollars per task | ~$0.04 | Climbs above ~$0.08 |
| Policy-filter block rate | ~0.4% | Climbs above ~1.5% |
Two things to notice. First, "healthy baseline" is per-team — your numbers come from running the agent for a week and reading the dashboard, not from a textbook. Second, every threshold above is an alerting threshold, not a target. The target is the baseline; the threshold is when something is wrong enough to interrupt a human. Step 5 of this module goes deep on alerting; here the point is just that the dashboard contains the right five numbers to put an alert on in the first place.
A worked regression example, using the same anchors. A bad model deploy lands at T = 10 h. Over the next hour:
- Success rate drops 94% → 78%. The model is making more wrong calls; the retry loop catches some but not all.
- p95 ticks-per-task climbs 4 → 9. The agent is retrying more, looping more, and burning ticks on recovery.
- p95 latency-per-task climbs 7 s → 18 s. Direct consequence of more ticks per task.
- Dollars-per-task climbs $0.04 → $0.11. Direct consequence of more model calls per task.
- Policy-filter block rate climbs 0.4% → 1.6%. The worse model produces more output that trips the safety filters from Module 3.
All five golden metrics scream. The vanity dashboard — QPS, tokens-per-second, average tool latency — shows roughly nothing. This is what "the metric exists because retries hide cost" means in practice: without the golden five, the only signal you would have is the slow tide of customer complaints arriving over the next day.
How does this map onto SRE-style SLOs?
If your team already speaks the SLO vocabulary from the Google SRE Book, the mapping is direct: success-rate and policy-block-rate are availability-shaped indicators; the percentile latency and ticks-per-task are latency-shaped indicators; dollars-per-task is the agent-specific addition that does not have a clean SRE-book analog because traditional services do not have a per-request marginal cost the way an LLM call does. Anthropic's Building Effective Agents makes a related point that agent KPIs need to fold cost into the definition of "working" — a successful but $5/task agent is not actually working for most product economics.
The "first dashboard" pattern
Pick the five golden metrics, put them at the top of the dashboard, and let nothing else share that real estate. Sub-dashboards beneath can break down by route, by tool, by model, by user cohort — useful for diagnosis once you know something is wrong. But the top dashboard answers exactly one question: is the agent healthy?. Five numbers, baselines underneath each, percentiles for the latency-shaped ones. That dashboard is what an oncall engineer looks at first when paged, and is the input to Step 5's alerting policy.
Try it: Stage 4 of the right pane is two dashboards side by side — the vanity dashboard on the left, the five-golden-metrics dashboard on the right. Drag the timeline past T=10 h, when a bad model deploys, and watch which dashboard shows the regression. Vanity barely flinches; the golden five all scream. The dashboard you alert on is the one on the right.
Alerting on Agent Behavior
How should you alert on agent behavior?
Agents are stochastic and bursty. Their per-task metrics jitter for honest reasons — a slightly harder mix of incoming questions, a brief tool-API slowdown, the weekly cohort whose accents tend to push the model into longer conversations. A pure absolute-threshold alerting policy fires on every one of those, trains the on-call to silence pages, and then misses the real regression when it arrives. A pure shape-change policy catches slow drift but takes longer to notice a sharp, narrow spike. The honest answer is both shapes, each on the metrics where it earns its keep — and a routing layer that decides whether a given anomaly pages a human, files a ticket, or just colors a dashboard.
This is the last step of the observability module, and it pulls the previous four together. Step 1 gave you spans. Step 2 gave you log data and redaction. Step 3 gave you the replay primitive. Step 4 picked the five metrics. Step 5 is where those metrics start interrupting humans — which is the most expensive thing observability can do, so it pays to be deliberate about when.
The two alert classes
Absolute-threshold alerts fire when a metric crosses a fixed line. "p95 ticks-per-task > 7." "Success rate < 88%." "Dollars-per-task > $0.08." Simple to reason about, simple to explain to a new oncall engineer, and they fire reliably when something genuinely sudden goes wrong (a bad deploy, an API outage). The weakness is slow drift — a 1% degradation per day for a week never trips the threshold, but the cumulative seven-day drop is real and user-visible.
Shape-change alerts fire when a metric deviates from a known-good reference baseline by a statistical margin. "p95 ticks-per-task is 3× the reference baseline." "Dollars-per-task is 2.5σ above the reference mean." Teams pick the baseline window from a healthy stretch of past traffic — Stage 5 of the right pane uses an early window from the run as a stand-in for that frozen reference. These catch drift because the baseline does not chase the metric: a steady 1%-per-day slide eventually clears the σ band whether or not it ever crosses any absolute number. (A truly rolling baseline that re-fits over a recent window has the opposite failure mode — it drifts with the metric and masks slow degradation, which is exactly the regression you want shape alerts to catch.) The weakness is the inverse: real steady-state shifts (Black Friday traffic, a viral mention) look like shape changes too, and the on-call gets paged for things that are not actually broken.
Pair them. Absolute thresholds catch the cliff. Shape changes catch the slope. Each one's weakness is the other's strength.
Page vs ticket vs dashboard
Not every anomaly deserves to wake someone up. The discipline that prevents alert fatigue is routing the alert by user impact, not by metric importance.
- Page — interrupts a human in real time. Reserve for things the user can already feel: success-rate drop, p95 latency spike, policy-filter block-rate spike (your safety net is over-triggering and the user is being told no when they should be told yes). The implicit contract is "this is bad enough to be worth a 3 a.m. wake-up."
- Ticket — files a JIRA / Linear / GitHub issue for tomorrow morning. The right destination for slow-drift anomalies caught by shape-change alerts. The drift is real but the user is not yet visibly impacted, so the response can wait until business hours.
- Dashboard — just colors the panel red. The right destination for anomalies that are noisy by nature (a small per-route success-rate dip is noise on a small route, signal on a big one), and the on-call engineer should see them only when also paged for something else.
The split prevents the most common failure mode: an oncall engineer who is paged repeatedly through the week for things that turn out to be fine, and who has now learned to silence the page reflexively. When the real page arrives, it is silenced along with the others. Google's SRE Workbook on alerting on SLOs is the canonical reference for this discipline; the agent layer inherits it cleanly because the underlying argument — alerts cost human attention, attention is finite — does not depend on the workload.
Alert on the golden, not the vanity
Step 4 picked the five metrics that define "healthy." The same five are the ones that earn alerts. Alerting on QPS gets you paged for a successful product launch. Alerting on tokens-per-second gets you paged for a model upgrade that produces slightly longer responses for honest reasons. The metrics that signal "the agent stopped doing the job" are the ones that should signal "wake someone up."
A workable starting policy:
| Metric | Absolute (page) | Shape (ticket) |
|---|---|---|
| Success rate | < ~88% sustained 10 min | > ~5 pp below reference baseline |
| p95 ticks per task | > ~7 sustained 10 min | > 2× reference baseline for 1 h |
| p95 latency per task | > ~12 s sustained 10 min | > 1.5× reference baseline for 1 h |
| Dollars per task | > ~$0.08 sustained 30 min | > 1.5× reference baseline for 4 h |
| Policy-filter block rate | > ~3% sustained 10 min | > 2× reference baseline for 1 h |
The exact numbers are placeholders — your team's healthy baseline (Step 4) is the input. The structure is the point: every golden metric has an absolute that pages and a shape-change that files a ticket, the page is gated on a sustain window so a 30-second blip does not interrupt anyone, and the ticket is gated on a longer window because slow drift earns slower response.
The on-call exhaustion failure mode
The temptation, when an alert misses something, is to tighten its threshold. "Ticks > 7 missed the regression last week, set it to 6." A month of tightening later, the on-call is being paged every other day for things that turn out to be normal, page fatigue sets in, and the real next incident is silenced.
The fix is rarely "tighter thresholds." The fixes that work:
- Add a shape-based alert for the same metric. The thing that missed the regression was usually a slow drift, and shape-based alerts catch slow drift by construction.
- Lengthen the sustain window on the absolute threshold. A 10-minute window catches a real regression and ignores a 90-second blip; a 60-minute window is fine for cost-shaped alerts where the response can be slower.
- Split by cohort. A success-rate dip that is invisible on the global metric can be loud on a per-route or per-customer-segment metric. Alert on the cohort metric instead of the global mean, and the signal is preserved without making the global metric noisier.
The honest target is not "no missed incidents." It is "few enough pages per week that an oncall engineer reads each one." The two are correlated — an alerting policy the on-call respects is one that catches real problems quickly.
Where the three primitives from Module 1 land in this picture
A brief recap to close the durability ↔ visibility loop. Module 1's three primitives — idempotency, checkpoints, retry policy — are what keep the harness running when individual processes fail. Observability is how you tell whether those primitives are doing their job or papering over a quality regression. Specifically:
- Idempotency keys show up as attributes on tool spans (Step 1). A jump in the rate of "deduped against original" responses is a signal that the harness is retrying more than it used to, which is itself a signal worth alerting on.
- Checkpoint writes are spans between ticks (Step 1). A jump in their latency or error rate is a hint that the durable storage layer itself is degrading, which deserves a separate alert.
- Retry attempts are sibling spans under tool spans (Step 1). A jump in the ratio of retries-to-successes is the classic "model-quality regression is hiding inside the retry budget" signal — which is exactly the regression Step 4 showed the golden five catching when the vanity dashboard missed it.
Without observability, you cannot tell whether retries are recovering blips (the system is healthy) or papering over a regression (the system is silently expensive and worse). The metrics-and-alerts layer is what makes the durability primitives accountable.
Where Module 2 ends and Module 3 begins
The three layers of a production agent so far: durability (Module 1) lets the agent survive process kills, network failures, and partial writes; visibility (Module 2, this module) lets you see inside the runs that survived; safety (Module 3, Layered Guardrails) prevents the failure from being shown to the user in the first place. Each layer assumes the one before it: guardrails without observability are unaccountable; observability without durability is a debugger on a black box that keeps disappearing.
Module 3 picks up the thread by asking the next question: when the model is about to do something the product cannot allow — emit PII, take a destructive action a user with this role cannot take, produce output that fails a policy filter — what is the structural defense, and how do you compose multiple defenses so no single filter is the whole story.
Try it: Stage 5 of the right pane runs two scenarios — a Sudden spike in p95 ticks-per-task (a deploy regression) and a Slow drift (gradual model degradation over days). Watch when each threshold (absolute vs shape) fires under each scenario. The sudden spike is what the absolute threshold is for; the slow drift is the one where shape-based alerts buy you the hours you need to debug before the user notices.