Deployment & Rollout for AI Agents
The Prompt-as-Code Problem
What is prompt-as-code?
Prompt-as-code is the discipline of treating an agent's prompt, model snapshot ID, and tool schemas as a single versioned artifact — checked into the same repo as the code, reviewed in PRs, tested by the same CI, and rolled out (and back) by the same deploy primitive. The opposite is prompt-as-config: a textbox in a vendor console where someone hot-edits the live prompt and hopes the on-call team is awake. The same engineering discipline you apply to a function body — version control, peer review, automated tests, deploy gates — is the discipline this module is about applying to the prompt.
Module 5 covered the evaluation substrate — what to measure and what gates the measurements feed. This module covers the deployment substrate that executes the verdict: the artifact you ship, how you ship it, how you pin it, and how you take it back. The two modules form a pair; eval-driven rollout from Module 5 is inert without a deployment system that can actually move the version atomically and revert it on demand.
Why the bundle is the unit
A working agent's behavior is not determined by the prompt alone, nor the model alone, nor the tool schemas alone. It is determined by the combination of all three. Three concrete failure modes that follow from forgetting this:
- Prompt edited, schemas not updated. The prompt now instructs the model to call
refund_order(order_id, reason, amount, currency)but the tool schema still only takes(order_id, reason, amount). Every refund call now fails validation at the tool boundary. Caught in seconds if the agent has a tool-call success metric; caught in days if it doesn't. - Model snapshot rolled forward, prompt unchanged. The provider ships a new snapshot under the same
claude-opus-4-7alias. The new snapshot is more verbose by 30% and re-introduces a hedging tic the prompt was specifically engineered to suppress. Two prompts were authored for two different models; only one of them was on the wire. - Tool schema migrated, prompt and model unchanged. The schema for
lookup_accountchanged the field name fromaccount_idtoaccountId. The model still emitsaccount_idbecause that's what its system prompt and few-shot examples say. Every lookup fails until someone notices the alias drift.
In each case the symptom is "agent broke for no reason." The root cause is that one of the three components changed independently of the others. The bundle is the unit. The discipline is to make that fact unforgeable.
What sits in a bundle
A minimal production agent deploy bundle has three load-bearing parts plus a small amount of metadata:
bundle: v1.5
├── prompt: prompts/refund-agent/v5.txt (verbatim text + few-shot examples)
├── model: claude-opus-4-7-20260108 (pinned snapshot, not "latest")
├── tools: schemas/refund-agent-v3.yaml (JSON Schema for each tool)
└── metadata:
├── author: alex@team
├── commit: 9af3c21
├── eval_set: golden-v3
└── reviewer: jordan@team
In storage this is a directory in the repo (most teams) or a versioned artifact in an object store keyed by commit SHA (some teams). What matters is that it has one version number and that a deploy moves the whole directory atomically. Individual files inside the bundle do not have independent version numbers — that is the whole point. Versioning each file separately is the world where they drift apart.
The metadata is not decoration. Author and reviewer let you ask "who shipped this" without grep. The commit SHA lets you reproduce the build months later. The eval set name lets you re-run the same bench against a future candidate to compare apples to apples.
What hot-edits actually cost
The argument for hot-editing prompts is always the same: it's fast, the change is small, and the team needs to ship a fix now. The cost lands later, in three places:
- Investigation overhead. When the metric drops at 3pm, the team's first question is "what changed?" Without a diff, the answer is a slack thread. With a diff, the answer is a single git log line.
- Rollback is impossible. "Roll back the prompt" requires knowing what the prompt was. A hot-edit leaves no copy of the previous version — the new prompt overwrote the old one in the textbox. The only path back is finding someone who remembers the exact wording.
- Cross-environment drift. The prompt that's running in production is now different from the prompt in staging, dev, and the repo. Every subsequent change has to merge three or four different "trunks" of the same prompt manually.
The price of being able to hot-edit a prompt for an hour of urgency is usually paid as multiple hours of investigation, plus the slow corrosive cost of "we don't really know what's running in prod." Most teams who hot-edit do not realize this is the deal they made until the second or third incident.
The CI pipeline a bundle goes through
A clean prompt-as-code pipeline looks identical to a clean code pipeline because there is no reason for it to look different:
- PR opens. Author commits the bundle change. The PR template asks: what changed, what bench is expected to move, what is the success threshold? (This is the pre-registration moment from Module 5's Step 5.)
- CI fires. Tool schemas validate. Prompts pass any structural lints (length limits, presence of required sections, no forbidden phrases). Determinism tests on tools that should be deterministic.
- Offline evals run. The bundle runs against the pinned golden set; pass/fail by pre-registered thresholds.
- Reviewer signs. A human reads the diff. Prompt diffs are read as carefully as code diffs — this is the unsung discipline most teams underweight.
- Merge. The bundle is built and stored as an immutable artifact keyed by commit SHA.
- Rollout. Canary (Step 2) and rolling release (Step 3) handle exposure.
What you should not see in this pipeline: a console-edit step, a "deploy prompt only" button, or an authentication boundary that lets one team member change behavior on the live fleet without leaving a trail.
When hot-edit is the right call
Not never — but rarely, and with a specific shape. The honest exceptions:
- Demo / sandbox environments. Hot-editing is the whole point of a playground. Don't put it on the same control plane as production.
- Break-glass during an active incident. When the bleed-stop is "change the system prompt to add a refusal," a hot-edit may save real money. The discipline is to immediately follow with a PR that codifies the change, and to have a separate audit log that captures who hot-edited what during the incident. (Module 7 covers the playbook.)
- A/B prompt experiments behind an experiment flag. If your A/B harness reads the prompt variant from a flag service, the flag service is the unit being versioned — and it has its own promotion pipeline.
In each case, hot-edit is the temporary exception, not the default workflow. A team whose mental model is "hot-edit by default, PR sometimes" is the team that finds itself, six months in, with no idea what's running where.
Try it: Stage 1 of the right pane is two parallel worlds. Left (hot-edit console): three free-floating sticky notes for prompt, model, tools — each with an independent "Edit" button. Click any one and the version pill flips; click a second and the panel tilts the notes and turns the badge to ⚠ drift. Right (atomic bundle): one wrapped artifact at v4. Click "Open clean PR" to start a new bundle authoring flow — the pipeline stepper lights Author → PR; click "Run eval gate" then "Deploy bundle" to advance to Eval gate → Deploy. Now hit "Inject regression (both)": the left silently bumps its prompt and ticks an incident counter ~2 seconds later; the right opens a PR whose eval gate fails (DANGER), so the bundle stays at v4 and the "Caught at PR" counter ticks up. The contrast is the teaching: the same regression takes minutes to surface on the left and never reaches production on the right.
Canary Rollouts
What is a canary rollout for an agent?
A canary rollout routes a small percentage of live traffic to the new bundle version, watches a short list of operational health metrics — error rate, latency, eval pass rate — and increases the share only when those metrics stay inside their pre-registered bounds. If the canary's metrics breach the bound, the rollout halts (or auto-rolls back) before the regression is visible to most users. Canary is the deploy-time safety net; it sits between the offline gate that ships at PR review and the user-experience gate that A/B testing measures over days.
The name is from coal mines — a small bird sent into the tunnel ahead of the workers, sensitive enough that it dies before the methane builds up to a level that would kill a human. In a software canary, the small slice of traffic plays the bird's role: it experiences the new version first and fails first, so the rest of the fleet doesn't have to. The whole pattern depends on (a) the canary metric being faster than the user complaint, and (b) the rollback action being reflexively cheap. Both are engineering decisions, not given truths.
Canary vs A/B harness — same plumbing, different signal
A common confusion: isn't a canary rollout just an A/B test? Mechanically the plumbing is similar — a traffic splitter routes some fraction to v1.5 and the rest to v1.4. But the signal they read and the decision they make are different.
| Property | Canary | A/B harness |
|---|---|---|
| What it watches | Operational health: error rate, p95 latency, eval pass rate | User-facing outcome: completion rate, conversion, escalation rate |
| Signal latency | Minutes | Hours to days |
| Sample sizes | Hundreds to low thousands | Thousands to tens of thousands per arm |
| What halts ramp | Threshold breach on any guardrail | Pre-registered primary metric regression |
| What it gates | The deploy itself | The post-deploy ramp from 5% → 100% |
| Failure response | Auto-rollback in seconds | Hold or rollback after a human decision |
A clean rollout uses both, in this order: canary clears the deploy (it's safe to expose this version at all), then A/B clears the ramp (users actually like or at least don't hate this version). Teams that try to do both with one gate end up with either a canary that's too slow (it waits for user metrics) or an A/B that's too noisy (it tries to read user signal from a 1% slice).
What the canary should measure
The mistake teams make is measuring everything they can think of. The good canary watches a small, fast, load-bearing list:
- Tool-call success rate. Did the agent successfully call its tools? An invalid argument from a prompt/schema drift fails here within seconds.
- Refusal rate. Did the agent refuse far more (or far fewer) requests than baseline? A safety regression usually shows here first.
- p50 / p95 latency. Did response time move? Catches model-snapshot regressions, retry-loop bugs, slow tool downstream.
- Online eval pass rate. The LLM-as-judge from Module 5 Step 1 judging live traffic. Slower to react than the others, but catches behavior that "succeeded" technically but went sideways semantically.
- Cost per task. Did each task suddenly cost 2× what it cost an hour ago? Either tool retries are out of control or the model is being more verbose.
Each guardrail has a pre-registered threshold (e.g. error rate < 2% absolute, p95 latency ≤ control + 200ms). The thresholds are PRs, not Slack threads.
The ramp schedule
A canonical canary ramp for an agent:
1% ──► 5% ──► 25% ──► 50% ──► 100%
| | | | |
5 min 30 min 2 hours 4 hours full
Each step is its own gate. The early steps are smoke tests — does the candidate produce any responses, does it not crash, does it not refuse 100% of inputs? The later steps test under load — does it hold up at 25% of fleet traffic without contention. Exact durations and step counts vary by team and depend on traffic volume; the qualitative shape — small early steps, longer later steps, each its own gate — is the universal part.
A canary that lives at 1% for a week and then jumps to 100% is not actually doing canary; it's doing a delayed big-bang. The intermediate steps exist so the team can spot a regression that only shows up at scale (contention, rate-limit interactions, batching pathologies). Skipping them means trusting that nothing scale-dependent will break, which is the kind of assumption that gets tested by accident.
Auto-rollback is the canary's superpower
The deciding feature of a canary system is what happens when a guardrail trips. Two patterns, in order of operational maturity:
- Manual rollback. A threshold breach pages a human, who reads the dashboard, decides whether to roll back, and clicks the button. Median time-to-rollback: 10–30 minutes during business hours, an hour+ overnight.
- Auto-rollback. A threshold breach drains the canary to 0% automatically and pages a human after the bleed has stopped. Median time-to-rollback: seconds.
The argument against auto-rollback is "what if it's a false positive?" In practice, a well-calibrated threshold (set from the bench replay procedure in Module 5 Step 5) has false-positive rates low enough that the cost of an occasional rollback drill beats the cost of a 30-minute manual response on every real regression. The argument worth taking seriously is "what if rollback itself is broken?" — which is the case for Step 5's monthly drill, not for keeping a human in the loop.
A subtle thing auto-rollback enables: aggressive ramps. A team that knows their rollback will fire in 30 seconds can hold the canary at 5% for 5 minutes, not 24 hours. The whole rollout pipeline tightens up. Teams without auto-rollback compensate by staying at low percentages longer, which is its own cost — every extra hour of low-ramp is an hour during which engineering effort is bound to one rollout instead of free for the next one.
When the canary disagrees with offline
A scenario that recurs:
- Offline bench at PR review: clean. All golden cases pass.
- Canary at 5%: error rate is 4% (threshold: 2%). Rollback fires.
What's happening? The bench was right about the cases it had. The bench did not have the cases that broke under canary. This is the same gap Module 5 Step 1 talks about between offline and online: offline only knows what you remembered to write down.
The correct response is to harvest the failing canary cases — pull a representative 20–50 of them — and add them to the bench. The next candidate against the bench will catch this class of regression at PR time. The bench grows; the canary's job shrinks back to its actual purpose, which is catching the new gaps you haven't yet seen.
A team that ignores canary failures because "the bench said it was fine" has accidentally inverted which gate they trust. The bench is your anticipation of failure. The canary is real failure. Real beats anticipation.
What a clean canary feels like
A few qualitative properties that distinguish a working canary from a decorative one:
- It can halt rollouts you wanted to succeed. A canary that has never blocked a deploy is either watching the wrong metric, set at the wrong threshold, or running on a code path that doesn't actually carry production traffic.
- It is the team's source of truth, not the dashboard. When someone asks "is v1.5 healthy?" the answer is "canary says yes" or "canary says no" — not "let me look at five dashboards."
- It enables small, frequent deploys. With a working canary, shipping a 3-line prompt edit feels safe at noon on a Tuesday. Without one, every change accumulates in a weekly batch deploy where the failure modes compound.
- The rollback is boring. When auto-rollback fires, nothing dramatic happens at the user layer because the bleed was stopped before it spread. The on-call learns about it from a notification, not from a paging storm.
The shape to aim for: small batches, fast feedback, cheap reversal. Canary is the deploy-time mechanism that makes that shape possible.
Try it: Stage 2 of the right pane is a live canary. The traffic source spawns ~3 requests per tick into two lanes — v1.4 control (indigo, 95%) and v1.5 canary (amber, 5%) — and each request bubble flies left-to-right; failed ones turn red. Below each lane: Error rate and Eval pass gauges with thresholds (2% / 95%). Drag the Canary ramp slider through 1% → 5% → 10% → 25% → 50% and watch canary bubble density rise. Click "Inject canary regression": within a few ticks the canary error gauge climbs past 2% — if Auto-rollback: ON, the ramp drops to 0% (red banner: ROLLED BACK) and the canary lane drains; if you flip Auto-rollback to OFF, the gauge stays red but the rollout doesn't self-heal, leaving the decision to a human. Toggle the slider back down, hit Reset, and try the same regression with auto-rollback off to feel the difference.
Rolling Releases
What is a rolling release?
A rolling release replaces old version instances with new ones gradually — region by region, pool by pool, or instance by instance — so the fleet always contains a mix of versions during the cutover, and the blast radius of any regression is bounded by how much of the fleet has flipped so far. The alternative is a big-bang flip, where all instances move to the new version at the same moment. Both ship the same bundle; the difference is in how the bundle reaches users.
Canary (Step 2) answers should we expose this version at all. Rolling release answers how do we move the rest of the fleet onto the version we just cleared. The two compose: a successful canary at, say, 5% exposure unlocks the rolling phase; rolling is the disciplined way to walk the remaining 95% to 100% without re-introducing the risk the canary was designed to eliminate. Vercel's Rolling Releases (GA in 2025) is one example of a first-class deploy primitive; AWS CodeDeploy and Kubernetes Deployments offer rolling strategies at different layers of the stack.
The shape of the rollout
A rolling release has three dials, in roughly this order of operational importance:
- Wave size. How many instances flip in each wave. Smaller waves → finer-grained risk; longer total rollout. Larger waves → faster rollout; bigger jumps in blast radius.
- Bake time. How long the fleet sits between waves. Longer bake → more time for the new version to expose any regressions on real traffic before the next wave commits. Shorter bake → faster rollout but less observation per step.
- Ordering. Region-first (US-East, then US-West, then EU, then APAC) vs pool-first (canary pool, then internal users, then 10% of paid, then everyone). Region-first contains time-zone-correlated failures; pool-first contains audience-correlated failures.
A workable starting set for an agent fleet of ~20 instances: wave size 2, bake time 60–120 seconds for low-risk changes, longer (5–10 minutes) for changes the team is less confident about. Real production rollouts tune this from incident history — every time a regression slips through, the team asks "would a smaller wave or longer bake have caught it?" and adjusts.
Why big-bang is so tempting and so wrong
The argument for big-bang: it's simple. One button, all instances flip, you're done. No partial state, no inconsistent fleet, no debugging "v1.4 on US-East and v1.5 on EU." For a small enough fleet on a low-stakes path, big-bang really is fine.
For anything else, the cost shows up in three places:
- Blast radius is 100%, by definition. A regression that survives the canary still affects every user the moment big-bang completes. Rolling release contains it to the fraction already flipped.
- No incremental signal. A regression that only manifests at scale (e.g. contention on a downstream tool, a thundering herd on cache invalidation) does not give you a smaller, easier-to-debug version of itself before the full version hits. Rolling release surfaces the same regression at 25%, where it's recoverable.
- Rollback is also big-bang. When you discover the regression, every user moves back to v1.4 in one synchronized flip — which has its own risk (cache cold-start, connection-pool reset, downstream rate-limit spike). Rolling reverse is the symmetric primitive.
The mental model: rolling release is to deploy what a parking brake is to driving. You don't need it on a flat empty road. You very much want it when the road turns.
Per-region health and the halt
A rolling release without a halt mechanism is just a slow big-bang. The halt is what makes it safe: each region (or pool) has its own health gauge — error rate, latency, eval pass — and if any gauge crosses its threshold, the rollout pauses before the next wave fires.
What happens after the halt depends on the team's operating model. The minimum is the halt itself — the rollout stops mid-wave and the on-call decides whether to resume, abandon (manual rollback), or wait for the gauge to settle. More mature deploy systems extend this with auto-rollback of the already-flipped instances — the same primitive Step 2's canary uses, scaled up to the rolling fleet — so a halted wave triggers a reverse-roll that returns degraded regions to v1.4 without paging anyone first. The sim on the right halts and waits; production-grade systems typically halt-then-rollback.
This is also where region ordering pays off. A US-East-first rollout that halts on a regression has already given the EU and APAC fleets a full bake-time worth of warning to never receive the bad version at all. The fix lands in v1.6 before EU's wave ever started.
Ordering: regions, pools, or random
Three common orderings, each with a different first-principles justification:
- Region-first. Each region rolls out completely before the next begins. Best for failures that are time-zone-correlated (a downstream service whose load profile differs by region) or jurisdictional (a refund-flow change that needs to land in EU before US).
- Pool-first. A "canary pool" of friendly users (internal staff, beta opt-ins) rolls out first; then a "free tier" pool; then "paid"; then "enterprise." Best for failures that are audience-correlated — a verbose-output regression that internal users tolerate but paying customers escalate.
- Random / hash-based. A uniform random sample of instances flips per wave, regardless of region or pool. Best when no correlation structure is known and the goal is just to keep the blast radius proportional to wave count.
Most teams default to region-first because the team is already organized around regional on-call rotations, and a halt aligns the incident response with the on-call who can react. Pool-first is the right answer when the audience structure of the product is the dominant risk axis (most consumer SaaS). Random is the right answer when neither of those is true.
What changes with platform-native rolling primitives
A few years ago a rolling deploy was a custom orchestrator the team built and maintained — a Python script, a Kubernetes operator, an Argo workflow. Increasingly, the deploy platform offers it as a first-class verb: Vercel Rolling Releases, AWS ECS rolling-update deployments, Kubernetes Deployment rolling updates with maxUnavailable / maxSurge. The shift matters because:
- The team owns one fewer load-bearing system. Custom rolling orchestrators tend to be the kind of code that's "fine until it isn't" — battle-tested for the last regression but not the next one.
- The primitive supports one-click rollback. Platform-native rolling primitives store the previous deploy as a first-class artifact, so the rollback action in Step 5 is
vercel rollbackorkubectl rollout undo, not "re-run the deploy pipeline pointed at the previous commit." - Observability is built in. The platform's dashboard shows per-instance version, per-wave progression, per-region health — without the team wiring it.
The argument for keeping your own orchestrator is usually "we need behavior X that the platform doesn't support." Audit that argument every six months. The platforms are catching up, and the cost of keeping the in-house version maintained tends to exceed the benefit of behavior X within a year or two.
The fleet view is the operating view
When a rolling release is in flight, the on-call's primary view is the fleet grid: every instance, its current version, its current health, the wave it belongs to. This is the view that answers the on-call's three questions at a glance:
- Where are we? (How many instances on v1.5, how many still on v1.4.)
- Is anything wrong? (Any red instances, any region with degraded health.)
- What's the blast radius right now? (% of fleet flipped × probability a regression affects them = expected user impact.)
A team without this view is flying the rollout from the deploy logs, which means they answer those three questions by running grep against text output. The cost is mostly invisible during normal rollouts and very visible during incidents.
Try it: Stage 3 of the right pane is a 20-instance fleet across 4 regions (5 instances each). Click "Start rollout (v1.4 → v1.5)" in the default Rolling mode with wave-size 2 and bake-time 0.8s — watch waves of 2 instances flip indigo → amber across regions, with the per-region health gauges below tracking 100% green. Adjust Wave size and Bake time sliders and reset to feel the speed/granularity tradeoff. Now switch to Big-bang mode and start the rollout again: all 20 instances flip at once. Use "Inject regression in: US-East" mid-roll to mark the v1.5 instances in that region as degraded (red); in Rolling mode the rollout halts and the EU/APAC fleets never receive the bad version; in Big-bang mode every flipped instance in US-East goes red simultaneously. The "Blast radius" panel on the right side shows the % of users on the new version at any moment — that's the upper bound on damage if a regression hits now.
Version Pinning
Why pin model and prompt versions?
Version pinning means specifying the exact snapshot ID of every load-bearing dependency — model, prompt, tool schema — in your deploy bundle, rather than a moving alias like latest. Pin in production so your behavior changes only when a human merges a PR; allow latest in dev so engineers can experiment with new models the day they ship. Without pinning, your agent's behavior is a function of when the request happens to fire, which is the property the rest of this module's discipline is trying to eliminate.
The clean analogy is dependency pinning in package management. A team that runs npm install react (no version) in production is taking whatever version of React npm happens to serve that morning. Catching a regression in React 18.3.1 is hard enough; catching one in "whichever React was current at 4am" is debugging archaeology. Model snapshots are the same shape of problem with higher stakes — behavior changes can be subtle (slightly more verbose output, a different default refusal threshold) and only show up in production load.
What providers actually ship under an alias
Every major model provider runs the same operational pattern. They publish a family alias (gpt-5.4, claude-opus-4-7, gemini-2-pro) that is updated to point to successive snapshots. The snapshots have date-suffixed IDs:
claude-opus-4-7-20251020— the snapshot released October 20.claude-opus-4-7-20251215— the snapshot released December 15.claude-opus-4-7-20260108— the snapshot released January 8.
Each snapshot is a different model. They are typically very similar — the alias is meant to be backward-compatible — but they are not identical. Differences that have actually shipped under the same alias across providers:
- A new snapshot is 15–30% more verbose on the same inputs.
- A new snapshot is more eager to ask clarifying questions.
- A new snapshot's JSON output is now wrapped in markdown fences (or vice versa).
- A new snapshot has different refusal thresholds for borderline content.
- A new snapshot's tool-calling argument order changes.
None of these are "bugs" from the provider's perspective — they are intended improvements. Each one is a behavior change to your agent. If you pin to the alias, you discover the change when your metric moves; if you pin to the snapshot ID, you discover it when you bump the pin, in a PR, before users see it.
Provider release notes name the major behavioral changes, but rarely capture the surface area that matters for any specific agent. Most teams find that one or two snapshots per year ship a change that breaks their agent in a way that the release notes did not flag — because the change only matters in combination with the team's specific prompt or tool schema.
What pinning actually looks like in the bundle
The bundle from Step 1 includes the snapshot ID directly:
# bundle: v1.5
prompt: prompts/refund-agent/v5.txt
model: claude-opus-4-7-20251215 # pinned snapshot ID
tools: schemas/refund-agent-v3.yaml
The deploy reads model from the bundle and passes it verbatim to the provider API as the model parameter. The provider returns predictable behavior because the snapshot is fixed. Your agent's behavior is now a function of the bundle alone, not the date.
Bumping the pin is a PR like any other:
- model: claude-opus-4-7-20251215
+ model: claude-opus-4-7-20260108
That PR runs through the same CI as a prompt edit — offline evals against the golden set, then the canary, then the rolling release. The team decides when to bump the pin. The decision is informed by provider release notes, but the gate is the eval.
The dev / prod asymmetry
The right pattern is not "pin everything everywhere." It is:
- Production: pin to a specific snapshot ID. Bump through a PR.
- Staging / canary: pin to the same snapshot ID as production. Bump them together.
- Dev / playground: allow
latest(or the alias). Engineers experimenting with prompts want to feel the new model immediately.
The asymmetry exists because dev and production optimize for different properties. Dev wants to surface the gap between today's model and yesterday's so the team has time to react to a snapshot change before it lands in production. Production wants behavior to be a deliberate human decision, not a side effect of provider releases.
A subtle wrinkle: engineers who only ever work in dev with latest can develop intuitions that don't transfer to production. If staging is pinned, the team's "what does the model do" muscle memory comes from the pinned snapshot, not the alias. Bias dev usage toward the pinned snapshot when the work is "what will ship," and toward latest only when the work is "what's coming."
Pin the prompt too
The same discipline applies to prompts and tool schemas, with slightly different mechanics. The prompt is content in your repo; pinning is "the deployed prompt comes from commit SHA Xabc," and updating means a PR. Tool schemas pin the same way. Tool implementations (the actual code that runs when the agent calls a tool) pin via the normal code release flow.
What makes this matter is that the behavior of the bundle depends on the cross-product of all three. A team who pins the model but lets the prompt be hot-edited in a console (per Step 1) has not actually pinned anything from a behavior perspective — the prompt change can still produce a new agent silently.
Pinning is a property of the bundle, not of any single component. The bundle is pinned iff every component inside it is pinned.
When pinning hurts
The honest tradeoffs of strict pinning:
- Security patches in the model. If the provider ships a snapshot specifically to fix a jailbreak, a pinned agent is still running the unfixed version until the team bumps the pin. The mitigation: pin tracking — a daily job that diffs your pinned snapshot against the latest, surfaces release notes, and opens a PR with "consider bumping to YYYYMMDD" attached.
- Cost / latency improvements. Providers regularly ship snapshots that are cheaper or faster at equivalent quality. A pinned agent leaves those wins on the table until the team bumps. Run the eval against the new snapshot quarterly; bump when the cost-or-latency improvement is worth the bump effort.
- Capability deltas. A new snapshot can do something the old one couldn't (vision, longer context, function-calling improvements). A pinned agent is locked out of new capabilities until the bump.
In every case, the answer is "bump deliberately, on the team's schedule, gated by evals" — not "never bump." Pin policies that never update are a different failure mode (slow rot) from pin policies that update reactively (immediate surprise). The team is choosing controlled rot over uncontrolled surprise; controlled rot is the cheaper bug.
Pin tracking as a routine
A workable operational shape:
- Daily: a CI job diffs
pinned_snapshotagainstprovider_latestand notes the lag in days. - Weekly: an automated PR is opened with "consider bumping to YYYYMMDD," attached release notes, and the eval suite pre-staged. The PR is not auto-merged; a human reads the release notes and the eval result.
- Monthly: the team reviews snapshot-drift across all pinned dependencies (models, embedding models, retrieval indexes) in a 15-minute sync. The output is either "bump now" or "explicit decision to hold."
This is the same shape of discipline as Dependabot for libraries — the routine is what keeps the rot bounded. A team that never bumps is a team whose pin will be 18 months stale the day a provider deprecates the snapshot they're on.
What about caching prompts and embeddings?
A real wrinkle from Module 4: if your agent uses prompt caching or a vector retrieval index, those are also versioned dependencies that can drift.
- Cached prompts are keyed by exact prompt-prefix content. A bundle bump that changes the system prompt invalidates the cache for that prefix — costs go up temporarily until the cache re-warms. Plan for this in the deploy window.
- Embedding model versions matter too. If you re-embed your retrieval corpus with a new embedding model, queries embedded with the old model retrieve the wrong neighbors. Embedding models pin into the bundle the same way generation models do.
- Retrieval index versions matter when the corpus is re-indexed. The team pins to a specific index version; bumping is a PR.
The general rule: anything whose change can move your agent's behavior is part of the bundle, and is pinned by the bundle's version.
Try it: Stage 4 of the right pane is two parallel agents — Unpinned (left, using claude-opus-4-7 alias) and Pinned (right, using claude-opus-4-7-20251215). Both ride the same baseline. Click "Provider ships a model update": a new snapshot dot appears on the unpinned timeline and the unpinned behavior line takes a visible step-jump; the "Surprises last 30d" counter ticks up. The pinned side is unchanged — same snapshot label, same flat behavior line. Click "Bump pinned version (PR + eval gate)": the pinned snapshot label advances (e.g. 20251215 → 20260108) and the behavior line moves smoothly to a new level — a controlled change, not a surprise. Fire several "Provider ships..." clicks and watch the unpinned behavior line bob around while pinned holds steady. The teaching is in the visible noise: behavior should change when a human merges, not when a provider's release pipeline happens to fire.
Rollback Discipline
What is rollback discipline?
Rollback discipline is the operational practice of being able to revert the whole bundle — prompt, model, tool schemas, code — to a known-good previous version in one command, sub-second, and to know it works because the team drills it. A rollback that takes 20 minutes because someone has to remember which git tag, copy-paste a config, restart the right pool, and pray is not rollback; it's hopeful recovery. The discipline is what lets the rest of the rollout machinery (canary, rolling release, auto-rollback) actually be safe — every guarantee those primitives offer is contingent on the rollback they would trigger being fast and complete.
This module's spine connects here. Step 1 made the bundle atomic so a rollback has something coherent to revert. Step 2's canary halts on regression — but only if rollback is the cheap response, not the expensive one. Step 3's rolling release contains blast radius — but only if reversing the roll is symmetric. Step 4's pinning makes the previous version concrete — but only if "the previous version" is what rollback restores. All four primitives lean on this step.
The bundle is the rollback unit
The single most common failed-rollback story is the one where someone "rolls back" by reverting the code repo while leaving the prompt configured in a separate console at the new version. The deploy returns to v1.4 code; the prompt stays at v1.5. The agent now runs prompts that instruct it to call v1.5 tool schemas (which v1.4 code does not know about) against v1.4 model behavior (which v1.5 prompts were not tuned for). Some tool calls fail validation. Some succeed but return the wrong thing. The team thinks they rolled back; users continue to experience the regression.
The fix is the discipline from Step 1: the prompt, the model snapshot ID, and the tool schemas live in the bundle, and the deploy primitive promotes or reverts the bundle, not "the code." A correctly atomic rollback restores all three components together, by construction. It is not the team's discipline to remember to also revert the prompt; it is the deploy system's responsibility to make that impossible to forget.
The mental model: the deploy bundle is a single immutable artifact identified by a version. The deploy system has a stack of these artifacts. The active version is whichever is at the top. Rollback pops the stack. The previous version becomes active. No "rollback the code, then rollback the prompt, then rollback the tool schemas" — those concepts don't exist at the deploy layer, because they're not separately versioned.
Why the previous version stays warm
A real-time rollback only works if the previous version is already loaded — model weights cached, tool schemas resolved, any process state hydrated. A "rollback" that has to cold-start v1.4 from scratch takes minutes, not seconds. During those minutes, every user request is either failing or being held in a buffer, and the on-call's confidence in the rollback drops as the timer climbs.
The discipline is to keep the previous version warm — running, ready, drawing zero or near-zero traffic. Most production deploy systems do this for you: the previous deployment's containers stay alive for a configurable bake period after the new deployment is fully ramped. Vercel's Rolling Releases, AWS ECS blue/green, and Kubernetes Deployments all expose this property.
The warm window matters because it is the window in which a one-command rollback is meaningful. After the warm window expires and the old version is decommissioned, a "rollback" requires a re-deploy of v1.4 — which is hopefully still in CI, with passing tests, and against an unchanged downstream world. That sequence can take 5–30 minutes depending on your pipeline. A 30-minute "rollback" is just a slow forward deploy with a misleading name.
Practical defaults: warm window ≥ 2× the longest expected detection time for a regression. If your canary detects regressions within 5 minutes but your A/B harness can take 24 hours to spot a user-experience issue, the warm window is ≥ 48 hours, not ≥ 10 minutes. Different teams set this differently; the question to answer is "how long after a deploy might we still want to roll back without a re-deploy?"
The shape of the rollback command
A clean rollback is one command that does one thing:
$ deploy rollback # → v1.5 → v1.4, atomic, sub-second
What it does mechanically:
- The deploy system updates the active-version pointer from v1.5 to v1.4.
- The traffic router begins sending requests to v1.4 instances.
- The v1.5 instances drain in the background (in-flight requests complete, no new ones admitted).
- The on-call gets a notification: "v1.5 rolled back to v1.4 at HH:MM:SS, initiated by alex@team, in response to canary error-rate breach."
What it does not do: page anyone first, ask for confirmation, require typing a version number, require knowing the previous version's git tag, require restarting a service, require opening a console.
The on-call's job in a rollback is to recognize the situation, fire the command, and wait the few seconds for the deploy system to confirm. Everything else is automation. If your rollback procedure has six steps, your rollback procedure is broken — fix the procedure, don't get better at executing it.
Auto-rollback vs manual rollback
The two patterns from Step 2's canary apply here too:
- Auto-rollback fires when a guardrail (canary error rate, p95 latency, eval pass rate) breaches threshold. The deploy system executes the rollback command on its own and pages the on-call after the rollback has completed. The on-call's job is investigation, not bleed-stop.
- Manual rollback fires when a human reads a signal and decides to revert. This is the path for regressions the automation didn't catch — a user complaint pattern, a downstream metric the team hadn't wired up, an "oh shit" intuition from someone reading the logs.
A working team has both. Auto-rollback is the default response for the named regressions. Manual rollback is the safety valve for the unnamed ones. A team with only manual rollback pays the bleed-stop cost on every regression. A team with only auto-rollback discovers the unnamed regressions through user complaints, which is the cost they were trying to avoid.
Drill it monthly
The reason most rollback procedures fail in incidents is that they were never exercised. The team built the system, deployed it, and then never used it again because nothing ever needed rolling back. Six months later, when something does need rolling back at 2am, the rollback command's syntax has subtly changed, or the warm-window expired, or the traffic router config drifted, or the team member who set it up has left.
The drill cadence that catches this:
- Monthly: a scheduled rollback drill in a staging environment that mirrors production. The team picks a recent deploy, fires the rollback, and times it. The output is a number; the number goes in a doc.
- Quarterly: a rollback drill in production. Pick a low-risk recent deploy, fire the rollback during business hours, time it, then roll forward again. This sounds aggressive but is much less aggressive than discovering at 2am that the production rollback path is broken.
- Per incident: if a real rollback was fired in the last 30 days, the next monthly drill is skipped. Real rollbacks are drills.
The drill is not just timing. It is also: does the rollback notify the right people, does the on-call dashboard reflect the rollback within seconds, do downstream systems handle the version change gracefully, do any background jobs need restarting. Each of these is a class of subtle failure that only shows up under real rollback, and the drill is your chance to find them on a Tuesday afternoon instead of a Sunday morning.
A useful metric to track over time: median rollback duration across drills. As the team practices, the number trends down — from minutes the first time, to tens of seconds, to single-digit seconds. The trend matters more than the absolute value; a flat line at "we don't know" is the failure mode.
When rollback is not the right response
A few situations where the on-call's first instinct should not be rollback:
- Data corruption. Rolling back the agent does not undo bad rows written to the database. The right response is to stop further writes (kill switch the agent if needed) and then triage the data separately.
- Tool downstream is the actual problem. Sometimes the symptom looks like an agent regression but the underlying issue is that the refund API started returning malformed responses. Rolling back the agent doesn't help. The investigation has to identify what changed downstream before any rollback decision.
- User-side problem. A change in user behavior (new spam pattern, a viral input that overwhelms the agent's specific shape) is not solved by rolling back to a version that handles the previous user behavior. The right response is a forward fix or a temporary input filter.
In each case the rollback should be on the menu as a possible response, but it isn't automatically the right one. The discipline is to ask, in the first 30 seconds of an incident, "what would rolling back do?" and to only fire if the answer is "stop the bleed." If the answer is "nothing, because the bug isn't in the agent," go fix the actual thing.
Closing the loop
The rollout pipeline this module described is one closed loop:
PR + eval gate (M5/S5)
└► Bundle (S1)
└► Canary (S2)
└► Rolling release (S3)
└► Pinned snapshot (S4)
└► Rollback discipline (S5)
└► back to PR
A regression at any step becomes a new bench case at the PR step. The bench grows. The team's anticipation of failure improves. The canary's job shrinks back to its original purpose (catching the residual). The drill cadence keeps the rollback primitive sharp. The pipeline is not "deploy a thing"; it is the slow accumulation of a team's collective understanding of how their agent fails, encoded as gates and primitives that survive the team's turnover.
Module 7 (Incident Handling) picks up where this module ends — what to do during the regression that this pipeline did not catch. The two modules are duals: deployment-and-rollout is the operating discipline that minimizes the rate of incidents; incident handling is the operating discipline for when one happens anyway.
Try it: Stage 5 of the right pane shows the deploy stack — v1.5 (active, amber), v1.4 (warm, indigo), v1.3 and v1.2 (archived, dim). The big red ⏪ Rollback to v1.4 button is in the controls. Click it with Atomic bundle selected: a sub-second rollback fires, v1.4 lights up as the new active row, and the "last rollback" timer reads ~300–500ms. Now click Reset, switch to Code only, and rollback again: this time the v1.5 row gets a red PROMPT ✗ pill, the v1.4 row gets a red CODE ✗ pill, and the mismatch banner fires — code is now at v1.4 (schemas-v3) but the prompt is still at v1.5 (prompt-v5). The bundle is no longer a single point in the stack. Hit "Run rollback drill" several times in a row; the drill bars trend down as repetition builds muscle memory (faster bars in lighter green). Each drill is a synthetic rollback that doesn't touch the active stack — it's the practice, not the operation.
Further Reading
- Vercel — Rolling Releases — first-class rolling-release primitive (GA June 2025) with one-click rollback.
- Google SRE Workbook — Canarying Releases — the canary chapter that the agent-canary pattern descends from.
- LaunchDarkly — Progressive Delivery patterns — feature flags + canary + rolling release as a unified operating model.
- Charity Majors — Deploys are the foundation of high-performing teams — the case that deploy velocity is itself a quality signal.
- Anthropic — Building Effective Agents — the broader operating model that this module's discipline ladders into.