Capstone — Reliability Operations for an Agent Fleet
SLOs for an Agent Fleet
What is an SLO for an agent?
An SLO (Service Level Objective) is a measurable target on a user-felt metric, set explicitly so the team can be graded against it: "95% of refund tasks complete successfully within 10 seconds, measured weekly." The SLO is the contract you make with your users in the language of telemetry — not the marketing language of "feels fast" or "works reliably." Everything else in this capstone (error budgets, on-call rotations, runbooks, drills) is mechanics for honoring SLOs. Without them, the whole apparatus has nothing to defend.
This module is the operating-model conclusion of the Agent Engineering track. The prior eight modules built the artifact — a harness, observability, guardrails, evals, deployment discipline, incident response, multi-agent topology. This capstone names the contract the artifact is run against. The shape is borrowed from Google's SRE book, and the borrowing is principled: agents are stochastic systems running in production, and the SRE community spent 15 years building the operating model for stochastic systems. We are not reinventing it; we are translating it.
The four parts of every SLO
A well-formed SLO has exactly four parts. Drop any one and you have a wish, not a contract.
- Metric (the SLI — Service Level Indicator). Task success rate, p95 latency, cost per task, p99 tool-call latency. The SLI has to be measurable from your production telemetry — exactly what Module 2 built into your span pipeline. If you cannot graph it, you cannot SLO it.
- Target. A specific number: 95%, 92%, $0.04 per task, 1.5s p95. Not "low" or "acceptable." A number the on-call can compare a dashboard reading to without judgment.
- Window. Rolling 7-day, calendar week, rolling 30-day. The window decides how fast you grade yourself. Shorter windows are louder (you alert more often) but noisier (one bad afternoon ruins the week). Longer windows are quieter and more honest at scale.
- Scope. "All tasks" vs "production-tagged tools only" vs "enterprise tier customers." SLOs apply to populations, not to "the agent." Most production teams run multiple SLOs at different scopes — a strict SLO for paid traffic, a looser one for free.
The right pane runs you through six candidate statements. Three are real SLOs by these four criteria, three are not — click each to reveal which and why.
Why 100% is not an SLO
The single most common mistake when teams first write SLOs is targeting 100%. "All tasks should succeed." "Every refund should complete." "Zero failed tool calls." This feels like the responsible target — it isn't.
Every stochastic system fails some fraction of the time. The LLM has a non-zero hallucination rate; the retrieval pipeline has a non-zero recall failure; the downstream tool has a non-zero 5xx rate; the model provider has a non-zero outage rate. A 100% target means every one of those failures is an SLO breach, which means you are always breaching, which means SLOs no longer steer anything — the alert is permanent and the team learns to ignore it.
The honest move is the move Google's SRE practice made: pick a target below 100%, accept that the remaining percent is the budget you will spend, and use that budget to ship features. The SLO is "95% succeed," the budget is "5% are allowed to fail." This is Step 2 — and it is the single most counterintuitive shift this module asks for.
What makes agent SLOs different from API SLOs
The SRE model was built for HTTP services where success is binary (2xx vs 5xx) and latency is bounded by a single backend call. Agents break both assumptions:
- Success is fuzzy. "Did the agent answer correctly?" is not a status code. Most teams settle on a judged SLI — a sample of traffic graded against the golden case suite from Foundations Module 7 or an LLM-as-judge calibrated against human labels. The judging itself is part of the SLO infrastructure.
- Latency has a wide distribution. An agent that completes 80% of tasks in 2s and 5% in 60s has a different operational profile than one that completes everything in 8s. SLOs almost always run on percentiles — p50, p95, p99 — never on means. The mean hides the long tail; the long tail is where the user pain lives.
- Cost is a first-class SLI. API SLOs rarely include cost because the unit cost of an HTTP call is small and known. Agent unit cost varies 10×+ by task and is a real failure mode — a task that "succeeds" for $4 of model spend is a different failure than a task that fails for $0.04. Production agent SLOs usually include a p95-cost-per-task SLO alongside success and latency.
The translation from SRE works because the shape of the operating model survives the differences. SLO → error budget → burn-rate policy → on-call → runbook → drill is the same loop whether the system is a CDN or a refund agent. The specifics change; the structure does not.
What you actually do with an SLO
The SLO does three jobs at once:
- It tells the on-call when to act. A page that fires on "the SLO is being burned faster than the budget refills" is a page that fires on user pain, not on backend noise. See Step 3.
- It tells the team when to stop shipping. The error-budget policy in Step 2 names the burn rate at which feature work pauses and reliability work takes over. This is the SLO doing its hardest job — overruling the org chart in favor of the contract.
- It tells the product when to renegotiate. If the SLO is consistently met and the budget is consistently full, the target was too loose — tighten it. If the SLO is consistently breached without anyone changing user behavior to reduce load, either the system needs investment or the SLO needs loosening. SLOs are renegotiated, not fixed forever.
A team that writes an SLO and then never references it again has not written an SLO. They have written a wish in different syntax.
Try it: Stage 1 of the right pane lists six candidate statements. Three are real SLOs (metric + target + window + scope); three are wishes dressed up in SLO language. Click each statement to reveal whether it qualifies and, when it doesn't, what would fix it. The pattern is consistent: anything missing one of the four parts — a number, a timescale, a population, a measurable metric — fails the test, no matter how confidently it's written. The bottom callout is the four-part rubric you can apply to your own draft SLOs.
Error Budgets
What is an error budget?
An error budget is the operational consequence of an SLO: if your SLO target is 95%, your error budget is the remaining 5% — the failure rate you have explicitly given the team permission to spend. The budget converts an SLO from an abstract contract into a decision rule. When the budget is healthy, the team ships features. When the budget is burning fast, the team is, by pre-committed policy, off feature work and onto reliability work. The brilliance of the construction is that no one has to renegotiate the trade-off in the middle of an incident — the policy already did, when everyone was calm.
The construction comes straight from Google's SRE book and applies cleanly to agent fleets. The only translation work is in the burn-rate alerting — agents have noisier success metrics than HTTP services, so the alerting thresholds tend to be coarser and the windows longer, but the policy shape is identical.
How the math actually works
If your SLO is "95% of tasks succeed over a rolling 30-day window," your error budget is 5% of tasks over that window. On 10,000 tasks per day, that's 500 failures/day × 30 = 15,000 failures of slack you have to spend. The questions the budget answers are:
- How fast are we burning it? "In the last hour we've already burned 30% of this month's budget" is a much louder signal than "the success rate dipped to 91%." Burn-rate framing turns a near-meaningless percentage delta into a clear time-to-exhaustion estimate.
- What happens when it's exhausted? The policy names the consequence. Most teams settle on a two-tier policy: soft-pause (reliability work goes on-deck, low-risk features still ship) when the budget is half-burned faster than expected, hard-pause (no rollouts, eng focus is entirely on reliability) when the budget is near zero. The right pane in Stage 2 visualizes both states.
- When does it refill? At the end of the rolling window. A 30-day window refills continuously; a calendar-week window refills on Monday. The choice affects how punitive a bad day feels.
The math is not the hard part. The hard part is the policy.
The policy is the contract with the team
A burn-rate policy says, in writing, before any incident: at burn rate X, we soft-pause. At burn rate Y, we hard-pause. The policy is signed by both engineering leadership and product leadership. This is what makes the SLO load-bearing — without the policy, every burn-rate alert becomes a negotiation between the on-call (who wants to slow down rollouts) and the product manager (who wants the feature shipped), and the negotiation always trends toward shipping.
Three real-world policy shapes:
- Soft-pause at 50% burn ahead of schedule. "If we've burned half this month's budget in the first quarter of the month, low-risk rollouts continue but high-risk rollouts (model changes, prompt rewrites) pause."
- Hard-pause at 100% burn. "If the budget is exhausted, no rollouts of any kind. Eng team is entirely on reliability work until the next window refills, or until the team explicitly negotiates an SLO renegotiation with product."
- Auto-revert on fast burn. "If a single deployment burns >20% of the monthly budget in the first 60 minutes after rollout, the canary harness from Module 6 auto-reverts without human intervention." This is the burn-rate policy fused with the rollback discipline.
The exact numbers are team-specific. The structure — pre-committed, signed, automatic where possible — is universal.
Why the burn-rate framing beats raw-percentage alerts
Two alerts on the same SLO:
- Raw-percentage alert: "Success rate dropped below 95% for 5 minutes." → fires constantly, including when nothing is wrong (just statistical noise on a 95% target).
- Burn-rate alert: "The current failure rate is burning the 30-day budget 4× faster than it refills." → fires only when the budget is genuinely at risk, regardless of short-term noise.
The burn-rate framing is what makes high-percentile SLOs alertable at all. A 99% target with a raw-percentage alert is a paging hell — every 5-minute statistical wobble crosses the threshold. The same 99% target with a burn-rate alert is silent until the burn is genuinely faster than the budget refills. Google's SRE workbook calls this multi-window, multi-burn-rate alerting; the same pattern is what production agent teams reach for once raw-percentage alerts have proven too noisy.
What the policy does that on-call cannot
The error-budget policy is doing one specific job that no other mechanism in this track does: it removes the on-call's authority to negotiate. When the page fires for budget burn, the on-call does not get to ask "is this important enough to slow down the release?" — the policy answered that already. The on-call's job becomes purely tactical: stop the bleed, follow the runbook, open the postmortem.
This matters because the on-call is, structurally, the worst person to renegotiate priority. They are tired, under stress, and accountable for an outcome they did not cause. The policy protects them from being asked to make a strategic decision at 3am. It also protects the team from the opposite failure — the heroic on-call who silently absorbs incident after incident because the work seems "low priority" and never escalates the underlying reliability debt.
A team without a burn-rate policy is one Slack message away from learning the wrong lesson from any given incident. A team with a policy learns the same lesson every time: the budget is the contract, the policy is the enforcement, the on-call executes.
When the policy needs to bend
Three legitimate reasons to override the policy, all of which must be logged:
- The SLO itself is wrong. A consistently breached SLO with no user complaints means the SLO is too strict. Renegotiate it (with product, not in an incident). A consistently met SLO with full budget means it's too loose. Tighten it.
- A one-off catastrophic event. Your model provider had a 4-hour global outage. The budget will burn regardless of anything your team did. Document the override, do not let it become the precedent.
- A regulatory or safety-driven exception. A compliance deadline that requires a deploy regardless of budget state. Rare. Logged. Reviewed at the next retro.
Everything else, the policy stands. The temptation to override is the strongest signal that the policy is doing real work — if it never blocked anything, it wasn't a constraint.
Try it: Stage 2 of the right pane has two controls. The slider sets your SLO target — watch the error budget shrink as the target tightens (a 99% SLO has 1/10 the budget of a 90% SLO). The three scenario buttons (Healthy / Rising / On Fire) inject different production failure rates and animate the cumulative burn curve over a 30-day window. The verdict box names which policy state the burn rate triggers — SHIP FREELY (slow burn), SOFT-PAUSE (rising burn), HARD-PAUSE (budget gone in days). The dashed line at 100% is the exhaustion threshold; the dashed vertical bar (when present) is the day on which you ran out of budget. The teaching is that the policy decision is visible — you can point at the chart and the verdict together, instead of arguing about it.
On-Call Rotations
What does on-call look like for an agent product?
On-call is the structural decision of who is woken up when an agent SLO is being burned faster than the budget refills, and what they are expected to do in the first 30 minutes. The structure has the same shape as any production-services on-call: a named primary, a named secondary, an explicit ack window, a paging route, an escalation policy, a handoff ritual at shift change. The work each on-call does is shaped by the first-15 playbook from Module 7 — bleed first, root-cause second. This step is about the organisational scaffolding that makes the playbook executable at 3am.
The honest framing: most teams shipping their first production agent skip on-call until the first overnight incident teaches them they need it. By then the team has already absorbed one or two incidents' worth of avoidable user pain. The lesson generalizes: the operational scaffolding is cheap to set up before you need it and expensive to retrofit after you do.
The four-part contract
A rotation that works is a four-part contract, all of which need to be on disk somewhere your engineers can find:
- Ack window. "The primary acknowledges the page within 5 minutes during business hours, 15 minutes overnight." The number itself is team-dependent; what matters is that the number exists and the pager-duty system enforces it (auto-escalating to secondary on un-acked pages).
- Escalation path. Primary → secondary → eng manager → director, with a time on each transition. The path exists so the on-call does not have to decide who to escalate to under stress.
- Handoff ritual at shift change. A 10-minute sync at the start of each shift covering: open incidents, ongoing investigations, deploys that landed in the last shift, alerts that fired but were silenced. Without the handoff, a shift change becomes an information cliff and the new on-call inherits problems they can't see.
- A runbook on disk for every alert that fires. This is Step 4. An alert with no runbook is an alert that asks the on-call to invent a procedure under stress — which is exactly the wrong moment to invent anything.
Drop any one and the rotation degrades from operations to heroism. Heroism scales to one or two people for a few months and then breaks — those people burn out, leave, and take the operational knowledge with them. Operations scales indefinitely because it lives in artifacts (runbooks, rituals, paging configs), not in individuals.
What the on-call is not expected to do
The single most consequential clarity in any rotation: the on-call's job is to stop the bleed, not to find root cause. This is the same principle as the first-15 minutes playbook elevated to a structural commitment. Two consequences:
- Mitigation always precedes investigation. If the runbook says "roll back the prompt" before "diagnose why it failed," the on-call rolls back. Investigation is the next-day postmortem's job (see Module 7 step 4).
- Heroic debugging is a failure mode, not a virtue. An on-call who fights an incident for 4 hours alone instead of escalating is performing the team's reliability debt as personal effort. The escalation path exists for the same reason the retry budget in Module 1 exists — a hard ceiling on heroics so the system fails fast and routes to recovery rather than burning indefinitely.
Teams that internalise the "stop the bleed" framing recover faster and learn more, because the postmortem the next day is operating on a calm system rather than a still-burning one.
Sev levels: how an alert decides whether to page
Three pragmatic severity levels common in production agent teams:
- Sev1 — page anyone, any time. User-facing outage, budget-burn rate that will exhaust this month's budget within a day, security event. The page is loud and routes to the rotation.
- Sev2 — page during business hours, escalate to sev1 if unresolved by EOD. Significant SLO burn that is not yet catastrophic, important-but-bounded customer impact, regression caught by shadow mode that hasn't reached real users.
- Sev3 — alert, do not page. Routes to a Slack channel with a clear owner. Offline eval regression, slow cost creep, drift detected by Module 5's drift monitoring. Picked up during business hours during normal triage.
The discipline is treating the sev assignment as part of the alert definition, not as the on-call's judgment call. Every alert has a pre-assigned sev in the same artifact that defines the alert rule. Otherwise the on-call ends up triaging severity instead of triaging the underlying issue.
What rotations look like at different team sizes
The structure of the rotation should match the team size, not aspire to a structure too large for the team to fill:
- 2–4 engineers. Daily rotation, primary only, no formal secondary. The primary's manager is the de facto escalation. This is fragile but honest — there are not enough people for a real rotation, so don't pretend.
- 5–10 engineers. Weekly rotation, primary + secondary, formal escalation to a tech lead or staff engineer. The classic small-team SRE pattern.
- 10+ engineers. Weekly rotation per service or per surface area (an agent fleet might split into "model-serving on-call," "tooling on-call," "eval on-call"). At this scale, the cross-team handoff ritual matters as much as the within-team one.
The wrong move at any scale is fake rotation — a rotation that exists in a config file but is not actually staffed or honored. A fake rotation is worse than no rotation because it gives leadership a false sense of coverage. Better to admit "we have one engineer on point and it's their manager when they sleep" than to pretend you have a 24/7 follow-the-sun rotation that you don't.
Compensation, hour limits, and the human cost
Production on-call has an irreducible human cost. The mature teams price it in:
- Explicit compensation for being on-call (a flat per-shift rate, time-off-in-lieu, or both). Treating on-call as "part of the job" without acknowledgment burns people out and disincentivizes signing up.
- Hour limits. No engineer covers more than N weeks out of M (commonly 1-in-4 or 1-in-6). On-call exhaustion is real and accumulates; rotation depth is the structural defence.
- Recovery after major incidents. A sev1 that ate the on-call's sleep earns them recovery time the next day. This is not generosity; it is risk management — a tired on-call is a more dangerous on-call.
These aren't soft skills; they are operational decisions that compound. A rotation that doesn't price in the human cost has a half-life of months before it starts shedding engineers.
Try it: Stage 3 of the right pane runs three real alert classes — a sev1 budget-burn, a sev2 tool-error storm, a sev3 eval regression. For each, flip between "with structure" (runbook + rotation + ack discipline) and "without structure" (slack-only, no runbook). The page fires either way. The timeline shows where the two paths diverge: with structure, the on-call follows the script and reaches MITIGATED or RESOLVED in minutes; without it, the on-call is investigating under stress, the secondary gets dragged in late, and the alert escalates. The teaching is that the structural decisions — runbook on disk, ack window, escalation path — are MTTR levers, not paperwork. The bottom callout names the four parts of the contract; if any one is missing, your rotation degrades to the "without structure" column.
Runbooks That Actually Work
What is a runbook?
A runbook is a step-by-step procedure for handling one specific class of incident — "budget-burn alert," "tool 5xx storm on charge.create," "offline eval suite regressing" — where every step is concrete enough that the on-call can execute it without inventing anything under stress. A vague runbook ("check the dashboards, see if anything looks off, roll back if needed") is decoration: it provides no leverage and makes the on-call do all the cognitive work anyway. A concrete runbook ("open Grafana panel /d/agent-budget-burn, read the 60-minute burn rate, if >20% open the deploys panel and find the prompt change in the last 4 hours, run lav-cli rollback prompt-version --to=previous, wait 5 minutes, recheck") is leverage: it converts a confusing 3am incident into a script to follow. The MTTR (Mean Time To Recovery — wall-clock minutes from page-fired to mitigated) delta between the two is the entire teaching of this step.
Runbooks are the operational counterpart to the playbooks from Module 7's first-15-minutes step. A playbook says how to approach an incident generally (bleed-stop-first vs investigate-first). A runbook says what to type into which terminal for one specific alert. You need both; they answer different questions.
The three fidelity levels
In practice runbooks land at one of three fidelities, with sharply different operational value:
- Vague. "Check the dashboards. If something looks wrong, escalate." This is what gets written when someone is told to "document the on-call procedure" and they have one afternoon. It looks like a runbook but it is just an instruction to do your job with extra steps.
- Concrete. Every step names a specific dashboard URL, query, command, or threshold. The on-call follows the script. No invention required.
- Concrete + drilled. Same concrete script, but every step has been executed by every primary at least once in a drill (see Module 7 step 5). The drilling is what catches stale dashboard links, missing permissions, and steps that have drifted out of sync with the system.
The MTTR impact of each level is real and roughly multiplicative: vague → concrete typically halves MTTR; concrete → drilled typically halves it again, in the experience of teams that have published their numbers. The values in the right pane are illustrative — your team's ratios will differ — but the direction (vague is the worst, drilled is the best, by meaningful margins) is consistent across SRE-mature shops.
What a concrete runbook step actually looks like
A useful litmus test: a runbook step is concrete enough if a brand-new on-call (joined the team last week) can execute it without asking anyone. That requires every step to name:
- The exact artifact to open. Not "the dashboard" but
https://grafana.lav.com/d/agent-budget-burn. Links rot — see Stage 4's point about drilling — but linkless steps rot worse. - The exact query, command, or button. Not "check the burn rate" but "the panel labeled 'burn over last 60 min' — read the number in the top-right."
- The exact threshold for the next branch. Not "if it's bad" but "if the number is greater than 20%."
- The exact command to mitigate. Not "roll back the prompt" but
lav-cli rollback prompt-version --to=previous. The command works in the on-call's terminal; they do not have to figure out which CLI tool to use.
Every step that violates one of these specificities is a step that asks the on-call to invent under stress. Stress reduces working memory; invention requires working memory. Concrete runbooks free working memory for the things that genuinely cannot be scripted — the judgment calls about whether to escalate, the customer communication, the moment of pattern-matching that says "wait, this incident looks like that one from March."
Why drilling is the runbook's single biggest lever
A concrete-but-undrilled runbook decays predictably. Three failure modes that drilling catches:
- Stale links. The Grafana dashboard URL changed when the team migrated to a new monitoring vendor. Step 1 of the runbook now points at a 404. The on-call discovers this at 3am.
- Missing permissions. Step 3 says
run lav-cli rollback prompt-version. The new on-call has not been added to the IAM group that grants rollback permission. They discover this at 3am. - Steps that subtly drifted. Step 2 says "look at the panel labeled X." The panel was renamed to Y six weeks ago. The on-call now hunts for a panel that doesn't exist, at 3am.
A drill — running the runbook end-to-end against a staging incident, or replaying a past production incident with the runbook in hand — catches all three. The drill is not a quality improvement to the runbook; the drill is part of the runbook. A runbook that hasn't been executed in 6 months is not in production; it is in someone's wiki, slowly drifting away from the system it claims to operate.
This is the same principle as the drill schedule from Module 7 step 5: the cadence of drilling sets the reliability ceiling for your on-call. Quarterly drills on the top-5 runbooks is a reasonable starting cadence; monthly is better; ad hoc is not drilling.
Which runbooks to write first
A team starting from zero cannot write runbooks for every possible incident. The honest prioritisation:
- The top-3 alerts by frequency. Whatever fires most often. Those are the runbooks where MTTR savings compound fastest.
- The sev1s that have happened in the last 6 months. Recent sev1s are the incidents your team has the freshest reasoning about. Capture the procedure before the memory decays. (This is also the postmortem follow-up output from Module 7 — most postmortems should produce a new or updated runbook.)
- The catastrophic-but-rare incidents. Model provider outage, full eval-suite regression, data-pipeline corruption. These rarely fire, but when they do the team needs the runbook because nobody has the procedure memorized.
Everything else can wait. Writing 40 runbooks for incidents that may never happen is a procrastination ritual; writing 5 runbooks for the incidents that do happen is operations.
The runbook–postmortem loop
Runbooks and postmortems form a loop. A postmortem (from Module 7 step 4) is the structured record of what happened, what was tried, what worked, what didn't. Most postmortems should end with one of three follow-up actions:
- New runbook. This class of incident didn't have one; now it does.
- Updated runbook. The runbook existed but had a stale link, a missing branch, or a step that didn't actually work. Patch it.
- Updated drill schedule. This class of incident is now important enough to drill regularly.
A postmortem that produces no runbook change is a postmortem that learned nothing operational. The loop is what compounds the team's on-call quality over time — each incident leaves the next on-call slightly better-equipped than the one before. Without the loop, the team relives the same incident every quarter.
What runbooks cannot do
A useful negative: runbooks are not a substitute for judgment in three classes of incident:
- First-of-its-kind incidents. The runbook cannot exist for an incident you've never seen. The first such incident is handled by the first-15 playbook from Module 7 — general principles, not specific scripts. The postmortem the next day is what creates the runbook for next time.
- Incidents that span multiple alerts. A budget-burn that is also a tool 5xx storm and a model provider outage is three runbooks at once. The on-call uses judgment to pick which one is the load-bearing fix first. Runbooks compose; they do not pre-compose.
- Customer communication. What to tell users, when to status-page, when to escalate to comms — this is judgment, not script. The runbook can list who decides but cannot decide for them.
Runbooks are leverage for the cases they cover. The judgment cases are why you still need senior on-calls and a clear escalation path.
Try it: Stage 4 of the right pane runs the same incident class (a budget-burn page) through three runbook fidelities. Toggle between VAGUE, CONCRETE, and CONCRETE + DRILLED and read the actual runbook steps the on-call would see. The vague version is generic instructions to do your job; the concrete version names specific dashboard panels, specific commands, specific thresholds; the drilled version is the same concrete script but every step is annotated with the drill history that validates it. The MTTR comparison bars at the bottom make the ratio visible — same incident, ~3.8× MTTR difference between vague and drilled. The bottom callout names the single biggest lever: drilling. A concrete-but-undrilled runbook decays predictably; a drilled runbook is the operating contract that holds.
From Foundations to Engineering
What does the full stack look like, from Foundations to Operations?
The full agent-engineering stack is three layers, each answering a different question and producing a different artifact: Foundations asks "does the agent work?" and produces a loop that handles a representative task; Engineering asks "does the agent ship?" and produces the harness, observability, guardrails, eval pipelines, rollout discipline, incident response, and topology that survive contact with production; Operations asks "does the agent stay up?" and produces SLOs, error budgets, on-call rotations, runbooks, and drills. Each layer assumes the one below it. None of the three is optional for a real production agent — but they ship in order, and most teams that fail in production fail because they shipped the bottom layer without the middle, or the middle without the top.
This capstone closes a curriculum arc that began with "What is an agent loop?" in Foundations Module 1 and ends here. The stack is not a metaphor — it is a literal stack of dependencies. The structure of this step is to name each layer's output, name the gates between layers, and give you the readiness checklist that tells you which layer your team is actually on.
Layer 1 — Foundations
The AI Agents — Foundations track gave you the agent itself: a loop that calls tools, retrieves from a corpus, plans, reflects, evals on golden cases, defends against the lethal trifecta, and picks between three topology designs at the capstone. The output is an agent that works on a representative task.
This is necessary and not sufficient. An agent that works in a demo is the premise of production-agent engineering, not the outcome. The teams that shipped a Foundations-grade agent and stopped are the teams that learned, three months later, that production traffic surfaces failure modes their golden eval set didn't cover and their loop wasn't durable enough to survive.
Layer 2 — Engineering (this track, Modules 1–8)
The eight modules before this one converted "an agent that works" into "an agent that ships":
- Module 1 — Production Harness Architecture: durable execution, idempotency, checkpointing, retry budgets. The agent survives process kills, deploys, and network failures without corrupting state.
- Module 2 — Observability for Agents: span-per-tick traces, what to log, replay, metrics, alerting. The agent is visible; failures are reconstructable.
- Module 3 — Layered Guardrails: defense-in-depth, input filters, output filters, policy, failure behavior. The agent has a structured response to malformed or hostile traffic.
- Module 4 — Cost & Latency Engineering: cost profile, prompt cache, result cache, parallel tools, batching. The agent fits the unit economics it has to.
- Module 5 — Production Evals & Shadow Mode: online vs offline, shadow mode, A/B harness, drift, rollout gates. The agent improves with evidence, not vibes.
- Module 6 — Deployment & Rollout: prompt-as-code, canary, rolling release, version pinning, rollback. The agent ships changes without breaking the existing surface.
- Module 7 — Incident Handling: first-15-minutes, trace replay, root-cause, postmortem, drills. The agent fails recoverably.
- Module 8 — Agent Teams: when teams beat a single agent, supervisor/worker, parallel-and-vote, handoffs, coordination tax. The agent scales topologically when one well-instrumented agent is not enough.
The output of all eight is an agent that survives contact with production traffic. This is the engineering bar. It is what you have to ship before you can credibly run an on-call rotation against the agent — because if the agent doesn't have observability, the on-call has nothing to look at; if it doesn't have rollback discipline, mitigation has nothing to undo to; if it doesn't have an eval pipeline, the postmortem has nothing to validate against.
Layer 3 — Operations (this capstone)
The four steps before this one built the operating model:
- Step 1 — SLOs: the contract with users, in the language of telemetry.
- Step 2 — Error Budgets: the contract with the team, in the language of burn rate and rollout policy.
- Step 3 — On-Call: the four-part rotation contract that makes the SLO defensible 24/7.
- Step 4 — Runbooks: the leverage that converts an incident from invention to script.
The output is an agent that stays up. This is the operational bar. It is what you have to ship before you can credibly take the page in the middle of the night without it costing you a senior engineer per quarter to attrition. The right pane has a readiness checklist that tests whether all five gates are actually live on your team — not aspirational, but actually shipped.
The gates between layers — and what happens if you skip them
The layers are stackable but not skippable. Three failure modes from teams that tried:
- Skipping Engineering, going Foundations → Ops. The team has SLOs and an on-call rotation but no observability, no rollback discipline, no eval pipeline. The on-call cannot diagnose, cannot mitigate, cannot learn. Every incident is a stress event with no leverage. Burnout within months.
- Skipping Ops, staying Foundations + Engineering. The team has a beautifully engineered agent and no operational contract. Every incident is a re-negotiation: is this important enough to roll back? Should we page? Whose responsibility is this? The agent works until the first overnight incident, at which point the team learns they needed Ops two months ago.
- Skipping a sub-module within Engineering. Most commonly, observability (Module 2) or evals (Module 5). The agent ships, runs for a while, drifts, and the team has no instrumentation to detect the drift. By the time customer complaints surface the problem, the eval suite to validate a fix doesn't exist either. The fix takes 6× longer than it should because the engineering scaffolding to support it wasn't built.
The gates are not bureaucratic — they are dependency-driven. Each layer's artifacts are inputs to the next layer's work. Skipping a layer means doing the next layer's work without the inputs it needs.
The honest framing of where most teams are
A real-world taxonomy of where teams shipping production agents in early 2026 actually sit:
- The majority are Foundations-grade with partial Engineering. They have an agent loop, some observability, ad-hoc evals, no formal rollout discipline, no SLOs. They are one incident from learning what Module 7 teaches.
- A meaningful minority are Foundations + most of Engineering, working on Ops. They have observability, evals, and rollback discipline; they are writing their first SLO and policy. This is the audience this capstone is built for.
- A small set are running the full stack. SLOs, error budgets, on-call rotations, runbooks, drills. They look like SRE-mature shops because they are SRE-mature shops — they took the SRE playbook, translated it to the agent setting, and shipped.
Your job over the next 12 months is to move up the taxonomy. The order matters: shore up Engineering before adding Ops, then add Ops layer by layer (SLOs → error budgets → on-call → runbooks → drills). Each shipped layer compounds the value of the ones below it.
What this track ends on
The Agent Engineering track is the operating manual for a Foundations-grade agent in production. Nine modules; eight on the engineering itself; one (this one) on the operating model. The bridge it builds is the bridge from "we have an agent that works in a demo" to "we run an agent fleet as a real product, with real users, real on-call, real SLOs."
There is no clean next track. Production-agent operations does not have a finish line — the operational frontier is moving constantly as models evolve, traffic grows, and your team learns what your specific incidents look like. The point of this capstone is to leave you with the shape of "what does production reliability look like for an agent fleet" that you can deepen against your own incidents, your own SLOs, your own customer base. The shape is portable; the specifics will be yours.
The most useful next move, for almost every team finishing this track, is to write the first SLO. Not the perfect SLO — the first one. The four parts, the policy, the gate. Ship it, watch it burn for a quarter, renegotiate. The operational discipline this track is teaching does not get installed all at once; it gets installed one SLO, one runbook, one drill at a time, and compounds from there.
Further Reading
- Google — SRE Book (free) — the canonical reference for SLOs, error budgets, on-call, and postmortems. Long but skimmable; the SLO and error-budget chapters are the load-bearing ones.
- Google — SRE Workbook (free) — the practical companion. The multi-window, multi-burn-rate alerting chapter is the production pattern for SLO-based alerting.
- Charity Majors — The Sociotechnical Path to High-Performing Teams — on-call sustainability and the human cost of rotation, applied beyond just SRE.
- Anthropic — How we built our multi-agent research system — operational discussion of running an ambitious agent product at production scale.
- OpenAI — Production best practices — vendor-side guidance on running LLM-based products against SLOs.
- Increment — On Being On-Call — full issue dedicated to on-call as a discipline; many of the patterns translate cleanly to agent products.
Try it: Stage 5 of the right pane shows the three-layer stack side-by-side (Foundations / Engineering / Ops) and below it a five-item readiness checklist — written SLOs, error-budget policy, live on-call rotation, runbooks for the top-3 alerts, postmortem ritual. Tick each box only if your team has actually shipped that artifact (not just talked about it). The verdict updates as you tick: 0–2 boxes is NOT YET (build the gates before scaling traffic), 3–4 boxes is SOFT-LAUNCH ONLY (enough for internal users, not enough for unattended overnight pages), all 5 boxes is PRODUCTION-READY. The teaching is that the readiness gate is structural, not aspirational — every missing box is a category of incident your team is structurally unable to handle. The closing callout names what you finish the track with: an agent that works, the engineering to ship it, and the operating model to run it. That is the whole stack.