What is the Agent Engineering track?

Nine interactive modules covering production agent engineering: durable harness architecture, observability, layered guardrails, cost & latency engineering, production evals & shadow mode, deployment & rollout, incident handling, agent teams, and a reliability-ops capstone.

Who is this track for?

Engineers shipping agents to production and AX (AI Transformation) engineers operating agent fleets. Assumes you have completed the AI Agents Foundations track or have equivalent prototype experience.

How does this differ from the AI Agents Foundations track?

Foundations (Track A) makes you prototype-ready: the loop, tools, workflows, retrieval, context, planning, evals, security, the decision rule. Engineering (Track B) makes you production-ready: harness durability, layered observability and guardrails, deployment, incident response, and running an agent fleet against an SLO.

What do I do in the first 15 minutes of an agent incident?

Confirm scope (one user vs many, one route vs the fleet), stop the bleed (rollback or feature flag off, do not debug live), capture a failing trace, and post a status update. Investigation comes after the bleed stops, not during.

How do I replay a trace in anger?

Use your trace store's replay primitive to re-run the failing task with the same inputs against a side cluster. Swap the model or prompt to confirm the hypothesis. If you cannot replay, your observability stack is the first thing to fix, not the incident.

How do I find root cause in a stochastic system?

Stop at the first repeatable failure. Strip the trace down until you find the smallest input that still reproduces the bad behavior; that is the bug. Five-whys for distributed systems works here too — the difference is that some "whys" terminate at "the model emitted this token," which is a valid answer.

What goes in an agent postmortem?

Timeline (UTC, every minute that mattered), what broke (the trace), why it broke (root cause), what kept it broken longer than it should have (incident-response findings), and the specific code/process changes that prevent it. Anti-pattern: "add more monitoring" with no concrete signal named.

Should we run drills?

Yes. Once a quarter, page a real on-call rotation, force a real rollback, walk a real trace replay. The drill is what makes the documentation match reality. Skipping drills is how teams end up with runbooks that no one can execute under pressure.

Incident Handling for AI Agents

The First 15 Minutes

What is the first-15-minutes playbook?

The first-15-minutes playbook is a fixed sequence of actions an on-call engineer executes the moment an agent-fleet alert fires: acknowledge the page, confirm scope, stop the bleed, capture a failing trace, post a status update, then — only then — start investigating root cause. The whole point of fixing the order is to take the highest-leverage action (stop-the-bleed) while the engineer's adrenaline is highest and judgment is lowest, before curiosity ("why is this happening?") competes with damage control ("how do I make this stop?"). Google's SRE workbook on incident response makes the same mitigation-before-root-cause point in classical distributed systems; this module adapts it to agent fleets, where the bleed-stop primitive is the rollback from Module 6.

Module 6 ended at rollback discipline — one command, sub-second, drilled monthly. This module begins where that one ended: the rollback is fired during an incident, and the surrounding choreography (page → ack → scope → rollback → trace → status → investigate) is what determines whether the rollback's sub-second latency translates into a sub-2-minute user-impact window or a sub-20-minute one. The deploy-side primitives are necessary; the incident-handling discipline is what makes them sufficient.

Why bleed-stop comes before investigation

A regression that fires at 14:03 UTC is, by the time the on-call is paged at 14:05, already burning users at some rate. Call that rate R (users-affected-per-minute). The total damage is R × T, where T is the time between the regression starting and the on-call hitting the rollback. Two choices in that minute determine T:

Bleed-stop-first. Page → ack → 30-second scope check → fire rollback. T ≈ 4 minutes. Damage = 4R.
Investigate-first. Page → ack → 5 minutes of trace-reading → "oh, it's the refund tool" → fire rollback. T ≈ 10 minutes. Damage = 10R.

The math is unforgiving: the second playbook costs 2.5× the damage for the same investigation depth, just out-of-order. The investigation is not skipped — it happens after the bleed stops, with the same trace, the same evidence, and zero ongoing damage. The trace is more useful after rollback because the on-call can spend an hour on it without users being affected during that hour.

The cognitive trap is real: an experienced engineer looks at a failing trace and wants to know why, and rolling back without knowing why feels like giving up. The discipline is recognizing that rollback is not a verdict. It's a tourniquet. The investigation is a surgery; you don't operate while the patient is bleeding out.

The six phases, in order

The named phases are not arbitrary — each one closes a specific failure mode that comes up in postmortems:

Page. The alerting system catches the SLO breach (refund-amount, error rate, eval pass rate). If this step takes longer than 30 seconds, the signals are wrong (see Module 2 on observability), not the response.
Acknowledge. The on-call confirms they are on it. This is a 5-second action that prevents the secondary on-call from being paged 5 minutes later because nobody owned the alert.
Confirm scope. One user vs many. One route vs the fleet. One region vs all. This is the 60-second triage that determines whether to fire the company-wide rollback or just throttle the affected pool. Scope-creep is the most common reason rollbacks fire late.
Stop the bleed. Fire the rollback (or kill switch, or feature flag off). The bundle from Step 1 of Module 6 is the unit; the rollback command from Step 5 is the verb. After this point, no new users hit the regression.
Capture trace. Take a failing trace and put it somewhere the postmortem will find it. This is the evidence; without it the investigation is theological.
Post status. A one-line update to the company status channel: what broke, what we did, ETA on writeup. This is the cheapest goodwill purchase in the business — it costs 30 seconds and buys hours of patience from product, support, and the comms team.

After step 6, the on-call can investigate. The investigation is the next module step's subject.

Why scope confirmation is not optional

The temptation in a real incident is to skip step 3 because step 2 already proved "the agent broke." Two stories from real postmortems where skipping scope cost a team:

The "fleet-wide rollback for a single-tool regression." A canary alert fires for lookup_account. The on-call reflexively rolls back the whole bundle to v1.4 — but the regression was only in refund_tool, a different code path, and v1.4 has its own known regression in payment_capture that they'd been trying to deprecate. They traded one regression for another, on a wider blast radius. The cost was a second incident on top of the first.
The "I thought it was fleet-wide so I escalated and woke up three teams." An on-call sees error rate spike and pages the entire reliability team. Five minutes later, scope check shows it's one customer's webhook integration that returned malformed JSON. The fix was a single config change. Three engineers ate their breakfast standing up because the on-call's first instinct was "page everyone, sort it out later."

Scope is a 60-second discipline that prevents 60 minutes of cleanup. The on-call's mental script is literally: one user or many? one route or all? one region or all? — three questions, ten seconds each, before any action.

When to deviate

The playbook is a default, not a law. Three honest deviations:

Customer-perceived severity is asymmetric. A refund-amount regression that overcharges customers is more urgent than an output-style regression that uses too many bullet points. The first deserves rollback at scope=1; the second can wait for scope check.
The bleed is not what you think it is. A "regression" that turns out to be downstream API drift doesn't get fixed by rolling back the agent — the bleed-stop in that case is rate-limiting the failing tool, not the rollback. The investigation question is "what would rolling back actually accomplish?" — if the answer is "nothing, because the bug isn't in the agent," the playbook step is "kill the affected tool path," not "rollback."
The data is being corrupted. Rolling back the agent does not unwrite bad database rows. The on-call's first action becomes "stop further writes" (kill switch), and the rollback is a secondary action after data integrity is preserved. (Module 6's rollback step covers this case.)

In all three cases, the on-call still moves through the playbook. They just adjust the bleed-stop step to fit the situation. The acknowledge / scope / capture / status order remains.

The on-call notebook

A working on-call carries a one-page notebook (or its Notion/runbook equivalent) with three things on it:

The rollback command for each service they cover. Verbatim. Not "look it up in the runbook." Verbatim.
The page-to-ack-to-rollback path, with the buttons or commands at each step.
The "who do I tell" line for status posts, including the channel name and a one-line template.

The notebook is what makes the playbook executable under pressure. An on-call who has to grep through three repos for the rollback syntax has already lost 4 minutes of the 15. The discipline is to compress the playbook into muscle memory and the muscle memory into a single page. Charity Majors' "On-call shouldn't suck" makes this point at the team level: the difference between a fast team and a slow team is usually not skill, it's pre-decided playbooks and an on-call experience that the team trusts enough to actually rehearse.

Try it: Stage 1 of the right pane is the page-to-investigate clock. Each playbook button is only enabled when its turn comes — the playbook order is enforced by the UI, not just suggested. Pick "Bleed-stop first" (top-left), then click ▶ Start clock — the simulated T-clock advances at 30× wall-speed. At T=00:15 the page auto-fires; the next-correct button lights up at each step: 1. Acknowledge → 2. Confirm scope → 3. 🛑 Rollback → 4. Capture trace → 5. Post status → 6. Investigate. The "users affected" bars stop growing the moment you click rollback. Now click "Investigate first" and reset: the same six actions are reordered to 1. Acknowledge → 2. Confirm scope → 3. Capture trace → 4. Investigate → 5. 🛑 Rollback → 6. Post status. The rollback button is intentionally locked until trace+investigate fire, so the bleed runs all through the investigation phase. The user-affected counter climbs steeply during steps 3 and 4; the same regression costs ~2.5× the users by the time you finally rollback. Same evidence, same rollback command, different cost — driven entirely by the order the playbook puts them in.

Trace Replay in Anger

What is trace replay?

Trace replay is the operational practice of taking a captured failing trace — every span, every input, every tool call, every model output — and re-running it against a side cluster (staging, a sandbox, a dedicated replay environment), with one variable swapped, to see whether the symptom moves. It is the closest thing an agent system has to a debugger: you cannot set a breakpoint inside a model's forward pass, but you can hold the prompt constant and swap the model snapshot, hold the model constant and swap the schema, hold both constant and swap the downstream tool stub. Each swap is an experiment. Each replay is an answer.

In a classic distributed system, an incident debugger has print statements, a profiler, and a tracing UI. In an agent system, the trace + replay is the equivalent of all three — the trace is the print statement (input → output at every span), the replay is the profiler (where does the symptom live?), and the side cluster is the sandbox (where you can rerun without affecting users). Module 2 on observability introduced replay as a capability of the trace store; this step is about using it under incident pressure.

What replay actually does

A trace store with replay support exposes one primitive that looks roughly like this:

$ trace replay --trace-id 5c4e... --override model=gpt-5.4-2026-01-10 \
  --against staging-side-cluster

What the system does on receiving that call:

Pulls the trace identified by trace-id from the trace store.
Hydrates the original inputs at each span (user prompt, system prompt, tool-call arguments) and the original recorded outputs (the model's responses, the tool's responses) — the recorded outputs are kept around so the new run's outputs can be diffed against them, not because they're "replayed" verbatim.
Re-runs the agent loop in the staging cluster: the model is re-executed (live model call, with the override applied if --override model=...), and tool calls are either re-executed against staging endpoints (if --override downstream=...) or replayed from the recorded responses (the default — the trace's recorded tool response is returned in place of a live tool call). All inputs are held identical to the original trace except the one variable named in --override.
Returns a new trace, side-by-side with the original.

The output is a binary: the symptom reproduces, or it doesn't. That's it. The replay is not the fix. It's a single experiment that narrows the hypothesis space by one.

The mental model: model calls always re-execute (you want to know what the new variant emits); tool calls replay-from-recording by default (so you don't depend on a live downstream that may have changed since the incident); both can be overridden. This is what makes replay deterministic enough to be useful — the only variable that changes between original and replay is the one you intentionally swapped.

The hypothesis swap menu

In a typical agent incident, four variables are worth swapping, in roughly this order of investigative payoff:

Swap the model. Most likely to be the culprit when the regression coincides with a model snapshot bump. If claude-opus-4-7-20260108 ships a new version and the regression appears the next day, swapping to the previous snapshot and replaying is the cheapest first experiment. If the symptom reproduces under the previous snapshot, the model is not the cause. (Note: version pinning from Module 6 is what makes "the previous snapshot" a concrete artifact you can swap to.)
Swap the prompt. Most likely when the regression coincides with a prompt PR landing. The atomic bundle from Module 6, Step 1 means every prompt change is a versioned artifact you can roll back to in the replay.
Swap the schema. Most likely when the regression coincides with a tool-schema change, or when the bug looks like a model bug but the offending field is actually being emitted with a malformed type, unit, or precision. (This is the case the simulation's example illustrates — the model "got the wrong answer" but the schema didn't enforce the unit.)
Swap the downstream. Most likely when nothing on the agent side changed but a downstream tool's behavior shifted (API drift, rate-limit change, data quality regression). Replay against a staging downstream — if the symptom disappears, the bug is not in the agent.

The discipline is one swap at a time. Swapping two variables simultaneously turns "the bug is here" into "the bug is somewhere in this 2D space." Swap, replay, observe, swap again. This is the agent equivalent of bisecting a regression range — slow if you've never done it, fast once you've drilled it.

When replay clears but doesn't fix

Some of the most useful replays don't identify the root cause directly; they only narrow the surface area. A few patterns:

Cleared in staging, fails in prod. The replay swap was "swap downstream to staging tools." The symptom disappears. But staging is not prod; the tool in staging may emit different units, different shapes, different rate-limit behavior. The replay told you "the bug touches the downstream surface," not "the bug is fixed." Do not ship "we ran it against staging and it cleared" as the postmortem conclusion. Make the next replay go after the actual differing field.
Reproduces under every swap. The replay swaps model, prompt, schema, downstream — the symptom appears in every case. This is the "bug is in the trace itself" pattern: the input contains something pathological, and the agent (regardless of model / prompt / schema / downstream) does the same thing with it. The investigation moves to the input — this is exactly the trace-shrinker pattern in Step 3.
Cleared once, fails on second replay. Almost always a stochastic-system manifestation. The model emitted a different token this time; the bug isn't deterministic, it's probabilistic. The fix is a prompt-side guard or a regression eval that exercises the probability, not a one-shot patch.

The shape to internalize: replay produces evidence, not verdicts. Each replay is one piece of evidence. The full postmortem (Step 4) is what synthesizes the evidence into a fix.

What if you can't replay?

A team that hits an incident and discovers their trace store doesn't expose a replay primitive is in trouble. Not because the incident is unsolvable — the team will solve it the slow way, by reading logs and guessing — but because every future incident will be solved the slow way too. The trace store is a dependency for incident response; if it isn't queryable in the right shape, incident response is permanently slower than it should be.

The honest reading: if your team's first replay-in-anger attempt during this incident fails because the trace store doesn't support overrides, the postmortem's top action item is not the bug fix. It is "make replay work by next month, before the next incident." Module 2's replay step covers what the trace store needs to expose; this module covers using what it exposes.

A second honest reading: even with a good trace store, the team that hasn't practiced replay under non-incident conditions will fumble it under incident pressure. The drill step covers why scheduled replay drills are the difference between a 4-minute replay and a 14-minute one.

The replay-first instinct

A few real-world heuristics from teams that drill this:

Replay before you theorize. A team's first impulse during an incident is to brainstorm hypotheses ("maybe the model regressed, maybe the schema, maybe..."). The faster instinct is "fire a replay against the previous bundle and see if it clears." Two minutes of replay beats two hours of theorizing.
Keep a "known-good" replay artifact. Before every deploy, the last green replay against the previous bundle is preserved. At incident time, that artifact tells you whether any known prior bundle would have handled the same input — useful when the regression bisecting is non-obvious.
Replay is a one-way ratchet on observability. Every time the team replays and discovers a missing data point ("the trace doesn't record the tool's HTTP status code"), the next observability PR adds it. Six months of incident-driven instrumentation produces a trace store that is exactly as informative as your incident history demands.

Try it: Stage 2 of the right pane is a captured failing trace from a refund-agent incident — three spans (plan_refund, lookup_order, write_refund) with the last one in red (wrote $4000 instead of $40.00). Click each of the four hypotheses in turn (Swap model, Swap prompt, Swap schema, Swap downstream), then ▶ Replay against side cluster for each. Watch which replays keep the trace red and which turn it green. Swap model stays red — the model isn't the cause. Swap prompt clears, but for a workaround reason (v4 had a cents/dollars guard that v5 dropped). Swap schema clears for the right reason — adding a unit field forces the tool to emit "cents" and the agent has no ambiguity. Swap downstream clears in staging but for an irrelevant reason — staging happens to emit dollars natively. Each replay's footer note explains the trap.

Root-Cause Discipline

What is root-cause discipline for a stochastic system?

Root-cause discipline is the practice of asking "why" until each branch of the question terminates at something the team can act on — even when one of those branches lands at "the model emitted this token," which in a deterministic system would look like a cop-out and in a stochastic system is a perfectly valid terminal. The shape is borrowed from Toyota's five-whys and the Google SRE postmortem template — the modification for agent systems is that some branches end at deterministic fixes (a schema field, a missing guard, a malformed JSON) and others end at probabilistic ones (a prompt regression eval, a few-shot example, a sampling temperature change). Both are valid root causes; treating only the deterministic ones as "real" is the failure mode this step is about.

The investigation cannot start until the bleed is stopped and a failing trace has been replayed. The replay narrows the hypothesis space; root-cause discipline takes the narrowed space and reduces it to a single concrete change — usually a PR — that prevents the regression class from recurring.

Five-whys in the agent context

The five-whys mechanic, adapted to agents:

State the symptom. "Refund agent returned $4000 instead of $40.00 for order A21."
Ask "why?" "Because the tool returned the value 4000 and the agent wrote $4000 in the response."
Ask "why?" again. "Because the tool returned 4000 cents, and there was no unit field, so the agent treated it as dollars."
Ask "why?" "Because the lookup_order schema's amount field has no unit companion. The schema design from 18 months ago assumed cents implicitly."
Ask "why?" "Because there was no eval case that tested cents/dollars ambiguity, so the schema design was never stress-tested. The first time it broke was in production."

The 5th "why" is the operational root cause: there was no eval case for this failure mode. The fix is two PRs: one to add the unit field, one to add the eval. The two PRs together cover both the immediate bug and the gap that let it ship. (Module 5's pre-registered metric discipline is what makes "add the eval" a load-bearing fix, not a fig leaf.)

The classical objection: isn't this just kicking the can? — "There was no eval" is itself a why. Why was there no eval? Because the team didn't anticipate the failure mode. Why didn't they? Because the schema was authored before the team had a cents/dollars-aware eval culture. At some point the chain bottoms out at "we didn't know this could fail," which is not actionable in retrospect. The discipline is to stop at the deepest layer that is actionable — usually the eval gap, sometimes the schema, sometimes the prompt.

The two terminal kinds — they ship together

A real agent incident usually has both a deterministic terminal and a stochastic one, and the complete fix bundle covers both. The cents/dollars example above has a contract gap (Branch A: schema lacks a unit field) and a residual + eval gap (Branch B: even with the contract fixed, the model could still produce a malformed output under a future prompt edit unless a guard + eval lock the behavior down). Treating these as competing explanations — "we just need the schema fix" or "we just need a prompt guard" — is the failure mode this step is about.

Branch A — deterministic terminal. The fix has a single PR. The schema gets a new field, the validator rejects the malformed call, the test asserts the new contract. There is no probability in the fix; the code now does the right thing every time.

Branch B — stochastic terminal. The fix has two PRs: a guard (prompt change, output validator, post-hoc check) and an eval (a regression test that would have caught this, exercised over many seeds to capture the probability). The eval is not optional — without it, the fix is a one-off patch that future prompt edits will silently undo. The guard reduces the probability; the eval prevents future regressions of the probability budget.

The reason this distinction matters for postmortems: deterministic fixes are easy to write and easy to review. Stochastic fixes are easy to write and hard to review — a reviewer can't tell whether the prompt guard actually reduced the failure rate without seeing the eval pass/fail across seeds. The postmortem's "specific changes" section (Step 4) must name both terminals' PRs (the schema PR + the prompt guard PR + the eval PR), or the review is incomplete.

A useful heuristic for which branch you're walking: if the bug's mechanism can be described without using the words "model," "prompt," "probability," or "sampling," you're on the deterministic branch. If those words are unavoidable, you're on the stochastic branch. Most real incidents have both — the postmortem's job is to walk both, not to pick one.

The trace shrinker

A trace at incident time is the failure under realistic conditions: 12 fields in the input, 4 tool calls, 8 model turns. The shrinker discipline is to peel the trace down until you find the smallest input that still reproduces the symptom — that is the bug's actual surface area. Everything you could remove without losing the symptom is incidental; what's left is essential.

The mechanical procedure:

Start with the full failing input.
Pick one field, remove it, replay.
If the symptom still reproduces, the field was incidental — leave it removed.
If the symptom stops reproducing, restore the field — it's load-bearing for the bug.
Repeat for each field. The bug's surface area is the set of fields you could not remove.

This is the same technique the C-Reduce paper formalized for compiler bug reports: shrink the input until you have the smallest witness. The same trick works on agent traces because traces are structured the same way — a tree of inputs, each one independently removable.

The output of the shrinker is two things at once: the test case for the regression eval (the smallest input that reproduces), and the concrete surface area the fix has to cover (the set of fields that matter). Without the shrinker, the postmortem ends up writing "be careful with refund amounts" — which is not a fix.

The why-tree that's actually one why

A short anti-pattern worth naming. Sometimes the why-tree collapses to a single "why" with five different surface variations. Example:

Why did refund overcharge happen? → Because amount was 4000 cents.
Why was amount 4000 cents? → Because the tool returned cents.
Why did the tool return cents? → Because cents was the schema's implicit unit.
Why was cents the implicit unit? → Because the schema was authored without a unit field.

This is a real five-whys, but only superficially — every step is restating the same surface in different language. The actual depth is one step ("the schema has no unit field"). When this happens, stop. Don't pad the chain artificially. The discipline is honesty about how deep the question actually goes, not depth for its own sake.

The same anti-pattern in reverse: a chain that looks like four whys but is actually four different bugs in disguise. If your "why" steps jump from "tool returned cents" to "the auth boundary leaked" to "the load balancer rate-limited," you have multiple independent bugs and need multiple five-whys, one per bug. Single chain, single bug. Two bugs, two chains.

The five-whys output is the eval set

The most useful single output of root-cause work is the regression eval that gets added to the golden set from Module 5. Each incident, walked all the way down to terminal, deposits one new case into the bench. Three years of disciplined root-cause work produces an eval set that is exactly tuned to the failure modes your agent has actually exhibited — which is the best possible regression bench, because no one else has those failure modes.

This is the closed-loop logic the deployment-and-rollout module ended on: the eval set grows with each incident, and the canary's job shrinks back to "catch the residual" — the failures the bench doesn't yet have. Over time, the bench catches the named failure modes at PR-time, and the canary is left handling the unanticipated ones. The team's anticipation improves; the canary's responsibility narrows. That's compounding reliability.

Try it: Stage 3 of the right pane has two interactive surfaces. Upper panel — five-whys ladder: start at the top with "Refund agent returned $4000 instead of $40.00." Two branches descend: Branch A ("tool returned an ambiguous number") ends at the deterministic terminal — the schema needs a unit field. Branch B ("model emitted '$4000' given an ambiguous input") ends at the stochastic terminal — a prompt guard + a regression eval. Walk both; the closing note explains that they are not alternatives — a complete root-cause for this incident ships the schema PR, the prompt guard, and the eval case as one bundle. Lower panel — trace shrinker: five input fields, all present, "still reproduces." Click "drop" on each field one at a time. Dropping customer_email, currency, reason, or order_id keeps reproducing — those are incidental. Dropping amount clears the symptom — that's the bug's surface area, and the smallest test case for the regression eval that ships in Branch B.

Postmortem Pattern

What is a useful postmortem?

A postmortem is the written artifact a team produces after an incident, structured so the next team — including future-you, including the engineer who joins six months from now — can understand what happened, why, and what specifically changes as a result. The structure is a fixed five-section template (timeline, what broke, why, what kept it broken longer than it should have, specific changes), and the quality of a postmortem is essentially the specificity of each section: timestamped events in UTC, mechanical descriptions of failure, named root causes, named detection gaps, and a list of merged PRs with linked numbers — not "we should add more monitoring" with no concrete signal named. Google's SRE Workbook calls the latter the most common postmortem anti-pattern; it shows up in nearly every team's first few postmortems and never quite stops appearing in their fiftieth.

The previous step (root-cause discipline) produces a chain of "whys" terminating at a deterministic or stochastic fix. The postmortem is what renders that chain into a public document — public to the team, public to the company, sometimes public to customers — so the failure mode joins the team's institutional memory rather than evaporating with the on-call's shift ending.

The five sections

The template most teams converge on, in order:

Timeline. UTC. Every minute that mattered. Page → ack → first action → rollback → resolution → final all-clear. Each line: HH:MM UTC — what happened — who did it. Skip-fields like "around 2pm" or "before lunch" are how you discover, three months later, that the timeline is unreconstructable.
What broke. The mechanical description of the failure, with numbers. "Tool returned cents (4000) but agent formatted as $4000 instead of $40.00. 8 refunds overcharged totaling $312.40 before rollback." Not "the agent had a bug with refunds." Numbers > adjectives.
Why. The root cause from the previous step, in one or two sentences. "Schema for amount lacks a unit field; the v5 prompt dropped the cents-guard that v4 had." Both terminals (deterministic + stochastic) named if both apply.
What kept it broken longer than it should have. Detection lag, response lag, escalation lag. "Eval suite had no cents/dollars case (detection); on-call hesitated 4 minutes on rollback because last drill was 5 months ago (response); secondary on-call had to be paged because primary's pager battery died (escalation)." This is the section where most team improvements compound — every detection gap named becomes a future alert.
Specific changes. PRs, alert thresholds, eval cases, runbook edits — with links. "PR #2341: add unit field. PR #2342: 3 new eval cases. PR #2343: refund-amount alert at >$1000 in 5 minutes." Every action item is a merged-or-mergeable artifact. Aspirational items ("be more careful") do not belong here.

The order is not arbitrary. Timeline first because everyone wants to know "when did it happen and what did we do." What-broke next because that's what readers need to understand the rest. Why third because root cause is meaningless without the mechanical context. Detection-gap fourth because it's the section that drives long-term improvement, not the immediate fix. Specific-changes last because they're the contract — what the team commits to ship.

The "add more monitoring" trap

The single most common anti-pattern, repeated almost verbatim across teams: the specific changes section contains some version of "we should add more monitoring" or "improve observability" with no specific signal named. This is not a change. This is an aspiration. A specific change names which metric, what threshold, over what window, paged to which channel. If you can't write that line, you don't have a change — you have a feeling.

The shape of a good monitoring action item:

PR #2343: Alert when refund_amount > $1000 count exceeds 5 in a rolling 5-minute window. Routes to #agent-on-call. Threshold derived from historical p99 of $87, which means any spike above $1000 is automatically anomalous.

The shape of a bad one:

Add better refund monitoring.

The first is two days of work that produces a load-bearing alert. The second is two months of meetings that produce no alert. The discipline is to write postmortems whose action items a different engineer could execute without further discussion. If the change requires another meeting to specify, the postmortem isn't done.

A second variant of the trap: "Add documentation." Documentation is a real thing, but "add documentation" is not. Which document, what section, what should it say, who maintains it. Same shape as monitoring — without the specifics, it's an aspiration.

Blameless does not mean detail-less

Most teams that adopt postmortem culture adopt it under the banner of blameless — the principle that the writeup focuses on systems and processes, not on naming the engineer who pushed the bad commit. This is correct and important; postmortems that name a culprit produce defensive writers who omit detail.

But blameless is not the same as vague. A blameless postmortem should still say exactly what happened: "At 14:11 UTC, the on-call ran deploy rollback from the wrong directory and the command silently no-op'd because the rollback script's working-directory check was missing." That sentence names a system gap (the script's working-directory check), not a person. It is specific without assigning blame. The on-call's name is in the timeline as the actor, not as the culprit.

Vague postmortems often hide behind "blameless" — "there was some confusion about the rollback command" — which is neither blameless nor useful. The discipline is specificity in the systemic frame: what happened, which tool/script/process failed, what about it failed. The person is a participant, not a defendant.

The cross-incident pattern recognition

The single highest-value output of a year of disciplined postmortems is not any individual writeup. It is the meta-pattern the team sees when they read six months of them together. The patterns that emerge from this aggregation:

The same root cause keeps appearing. Three out of the last five postmortems trace back to schema fields without unit metadata. The fix is not a sixth one-off — it's a systemic change (a schema linter, a required unit-field convention).
The same detection gap keeps appearing. Four postmortems name "the eval suite didn't have a case for this." That's not four eval misses; it's a process gap in how eval cases get authored. The fix is to embed eval-case-authoring into the PR template.
The same response lag keeps appearing. Five postmortems show 3+ minute lag between page and rollback. That's not five slow on-calls; that's a drill-frequency problem. (Drills, the next step.)

Most teams skip this aggregation step. They write good individual postmortems and never re-read them as a corpus. The teams that do read their incident history as a corpus are the teams whose incident rate drops over time; the teams that don't are the teams that repeat the same incident class for years.

A practical cadence: once a quarter, a senior engineer reads the quarter's postmortems back to back, looking for repeated root causes, repeated detection gaps, repeated response lags. The output is a single "patterns we're seeing" document, which becomes input to the next quarter's reliability roadmap. This is the engineering-management equivalent of the drift detection from Module 5 — except the drift is happening in your incident profile, not your model behavior.

The audience problem

A subtle issue: the postmortem has at least three audiences, and they want different things.

The engineering team wants the mechanical "what broke, why, what changes" — the technical core.
Product and support want "what did users see, for how long, what should we tell them."
Leadership wants "is the team learning from this, or is the same thing going to happen again."

Most teams write a postmortem for one audience and then re-explain it three times for the others. The mature shape is a single document with three sections clearly labeled — the technical core (sections 1–5 above), a user-impact summary at the top, and an "improvement commitments" section at the bottom that maps the specific changes to a calendar. The same document serves all three audiences without three rewrites.

The user-impact summary deserves named numbers: how many users hit the regression, how long it lasted, what the financial / trust / UX impact was. "Eight refunds were overcharged totaling $312.40. Refunds were issued by 18:00 UTC." Concrete in user terms. Engineering-internal terms ("8 traces failed") are not user-facing.

Try it: Stage 4 of the right pane is a postmortem assembly tool. The five sections are empty; the phrase library below has 15 candidate phrases tagged [concrete] (green), [vague] (amber), or [anti-pattern] (red). Click phrases to add them to a section; click again to remove. Try assembling a bad postmortem first — pick the vague and anti-pattern phrases. The overall pill says "Anti-patterns present." Now reset and pick only concrete phrases — pill flips to "Specific enough to act on." The per-section pills show how the same section quality reads from "empty" → "sparse" → "specific" → "anti-pattern" depending on which phrases you drop in. The lesson: the same incident produces a different postmortem depending on whose phrases the author chooses.

Incident Drills

What is an incident drill?

An incident drill is a scheduled, low-stakes execution of a real incident-response procedure — rollback, trace replay, page routing, kill switch — in a staging or low-traffic production environment, with a real on-call, on a real day, against the real procedure, without an actual incident in progress. The drill's purpose is to verify that the procedure still works, that the on-call still remembers it, that the tooling has not drifted, and that the time-to-completion is decreasing over time. The teams that drill these procedures monthly recover from real incidents in minutes; the teams that don't recover in tens of minutes — through no fault of the on-call, who is doing what the runbook says, only to discover the runbook is stale. Charity Majors' "Friday deploy freezes are exactly like murdering puppies" makes the same point about deploys: a procedure you avoid using is a procedure you can no longer trust.

This is the final step of the deployment / observability / incident loop the module started with. Module 6's rollback drill named the cadence (monthly in staging, quarterly in prod); this step explains why the cadence is the load-bearing piece, what specifically drifts when it lapses, and what a working drill calendar looks like across the year.

The three drill kinds

Most teams converge on three drill kinds, each exercising a different operational surface:

Rollback drill. Pick a recent deploy, fire the rollback command, time it, then roll forward. Sometimes monthly in staging, quarterly in production. Exercises: the deploy system's rollback path, the warm-window for the previous version, the trace store's ability to attribute traffic to the right bundle, the on-call's muscle memory for the command.
Replay drill. Pick a trace from the last 30 days, fire a replay against the side cluster with one override (model / prompt / schema / downstream). Exercises: the trace store's replay primitive, the side cluster's parity with prod, the on-call's hypothesis-swap muscle memory, the latency of getting a "did it reproduce" answer.
Page-storm drill. Synthetically fire 5–10 alerts in a 2-minute window. Exercises: page routing (do the right people get paged?), de-duplication (does the on-call see 5 alerts or 1?), escalation (does the secondary get paged when the primary doesn't ack?), Slack / channel routing, ack/escalate/resolve state transitions.

Each drill kind exercises a different layer of the response stack. A team that drills only rollbacks discovers, during a real incident, that their replay path has drifted. A team that drills only paging discovers their rollback command has changed syntax. The mature shape is to rotate across all three, with a cadence per quarter that touches each at least once.

Why drills decay

A team builds an incident-response system, deploys it, and then never uses it again because nothing breaks. Six months later, something breaks at 2am, and the team discovers:

The rollback command syntax changed. Some platform upgrade renamed vercel rollback to vercel deployments rollback, or the team's wrapper script's CLI changed. The on-call types the old command, gets an error, has to read the runbook.
The on-call rotation has rotated. The engineer who set up the incident-response procedure left the team four months ago. The current on-call has never used the rollback command in anger. They read the runbook for the first time at 2am.
The trace store's replay path has drifted. A schema migration added a required field that the trace store doesn't populate, so replay errors out. The on-call has to debug the replay tooling before they can debug the incident.
The warm-window expired. The previous version was decommissioned 4 weeks ago because the platform's default warm-window is 30 days. Rollback now requires a re-deploy of v1.4 — a 20-minute operation instead of a 30-second one.
The pager rotation config has drifted. The Datadog/PagerDuty config was edited 3 months ago; the team didn't notice because nothing paged. At 2am, only secondary gets paged because primary's rotation expired.

Each of these is a slow drift; none of them is detectable without exercising the procedure. The drill is the only mechanism that catches them before a real incident does.

The drill cadence that catches drift

A workable cadence for a team running an agent fleet in production:

Monthly: a rollback drill in staging. A specific engineer is named the drill owner. They pick a recent deploy, fire the rollback, time it, document the duration. The document goes in a shared doc that compares duration over time. The drill is 10 minutes of clock time and 0 minutes of preparation if the system is working; an hour of preparation and 30 minutes of debugging if the system has drifted.
Quarterly: a rollback drill in production. Same shape, real prod traffic. The team picks a low-risk recent deploy and rolls it back during business hours, with the comms team notified in advance ("we're testing rollback at 14:00 UTC, expect a brief blip"). Quarterly in prod is enough to catch drift; more often is operationally expensive.
Quarterly: a replay drill. A trace from the last 30 days is picked at random; the on-call fires a replay with each of the four overrides (model / prompt / schema / downstream); time-to-answer is logged.
Quarterly: a page-storm drill. Synthetic alerts fire; the on-call walks the ack/scope/rollback/trace/status path against the simulated incident. The drill ends when the on-call hits "all clear" within the 15-minute window from Step 1.
Per-incident exemption: if a real incident fired in the last 30 days, the next scheduled drill is skipped. Real incidents are drills.

The cadence is not negotiable on a per-quarter basis. A team that says "we're too busy to drill this quarter" is the team that will discover their procedures don't work the next quarter. The drill is the cheapest way to surface drift; skipping it doesn't save time, it just defers the cost.

What gets measured during a drill

A drill is not just an execution. It is an evaluation of the execution against the previous one. The team tracks at minimum:

Time-to-rollback. Page → ack → fire rollback command. Should be under 2 minutes for a well-drilled team; under 30 seconds for the page-to-ack portion alone.
Time-to-replay-result. Open trace store → identify failing trace → fire replay with one override → see "reproduces or not." Should be under 3 minutes.
Pager routing correctness. Did the right person get paged? Did the secondary get paged when the primary didn't ack within the timeout? Did the channel get the right notification?
Runbook freshness. Did the on-call have to deviate from the runbook? Did they discover an out-of-date step? Was the rollback command in the runbook actually the right one?

A useful artifact: a single internal dashboard that shows, per drill kind, the median duration over the last 8 drills. The trend is the teaching — flat or rising means the team is drifting; declining means the muscle memory is compounding. This is essentially the serving-metrics percentile discipline from the LLM Serving track applied to the team's own operational reflexes.

When the drill goes wrong

The honest part of drill culture: the first few drills will go wrong. The team will discover the rollback command syntax has changed, the replay tooling errors out, the pager goes to the wrong channel. This is the point of drilling — discovering these failures on a Tuesday afternoon rather than during a real incident.

A drill that surfaces a broken procedure is a successful drill. The output is a PR fixing the procedure, not a postmortem about the broken drill. The team's response should be relief ("we found this before a customer did"), not embarrassment ("our procedure was broken"). The teams that internalize this stay drilling; the teams that don't go through a cycle of "drilling looks bad → stop drilling → real incident reveals broken procedure → drilling looks good again → start drilling" — at the cost of a real incident.

A useful inversion: every broken drill should produce one PR that fixes the procedure and one PR that adds the gap as a regression test for future drills. "The rollback command's working-directory check was missing" produces both a fix to the script and a test that the next drill executes from a non-trivial working directory. This compounds: each drill makes future drills more robust.

The end of the module — and how it connects back

This module has been the response counterpart to Module 6's preparation discipline. The deployment module is the operating practice that minimizes the rate of incidents — atomic bundles, canary, rolling release, version pinning, rollback discipline. The incident module is the operating practice for when an incident happens anyway — the first 15 minutes, replay, root cause, postmortem, drills.

The two modules together form one closed loop:

PR + eval gate (Module 5)
  └► Bundle (Module 6, S1)
        └► Canary (S2)
              └► Rolling release (S3)
                    └► Pinned snapshot (S4)
                          └► Rollback discipline (S5)
                                └► First 15 minutes (Module 7, S1)
                                      └► Trace replay (S2)
                                            └► Root cause (S3)
                                                  └► Postmortem (S4)
                                                        └► Drills (S5)
                                                              └► back to PR + eval gate

Each incident, walked all the way down to its terminal, deposits a regression case into the eval set, an alert into the monitoring config, a step into the runbook, a fix into the procedure. The teams that close this loop see their incident severity drop over years even as their traffic grows; the teams that don't see incidents grow with traffic linearly. Module 8 (Agent Teams) picks up the next thread: when one well-instrumented agent isn't enough and the right answer is a team of agents — with its own coordination overhead, its own failure modes, and its own version of every discipline this module covered.

Try it: Stage 5 of the right pane is a 12-month drill calendar. Each quarter (Q1–Q4) has slots for + Rollback drill, + Replay drill, + Page-storm drill. A 🚨 fires at month 9 (Q3) — a real incident. The MTTR formula is kind-aware: each unique drill kind exercised before the incident saves a different amount (rollback −6 min, replay −5 min, page-storm −4 min); duplicates of the same kind add only a small repetition bonus. First, click "Fire incident" with no drills scheduled — MTTR reads 22 min, drift = severe, the failure list shows three red ✗ rows (rollback path stale, replay tooling broken, paging config drifted). Reset, stack 6 rollback drills across Q1–Q3 (no replay, no page-storm), fire again — MTTR only drops to ~14 min because the rollback path is well-worn but the replay and paging surfaces are still untested; an amber ⚠ only 1/3 kinds covered warning fires. Reset, schedule one of each kind (3 total) in Q1–Q3 — MTTR drops to ~7 min and drift = none, all three failure rows are green. The teaching: drill kind matters more than drill count. Six rollbacks don't substitute for one replay drill; each procedure has its own decay surface that only its own drill catches.