Production Evals & Shadow Mode for AI Agents
Online vs Offline Evals
What is the difference between online and offline evals?
Offline evals run a fixed set of golden test cases through a candidate version, off the production traffic path, before any real user sees it. Online evals score live production traffic in flight — same model decisions the user already received, judged after the fact. The two are not alternatives. Offline catches the regressions you anticipated; online catches the failures your golden set could not predict. A team that ships only one of these is shipping with one eye closed.
The shape of the problem is the same as test pyramids in classical software — unit tests catch what you wrote tests for; integration and production telemetry catch what you didn't think to write tests for. The difference is that LLM agents are stochastic and the judging criteria are usually qualitative, so neither half can be a single boolean assertion. Both halves return a score distribution, and the gates you build on those distributions are what this module is about.
Offline: the bench you can re-run on demand
An offline bench is a fixed corpus of inputs paired with one of three outcomes per case:
- A reference answer — the exact expected output. Useful for deterministic tools like SQL generation or schema validation.
- A reference rubric — a checklist the response must satisfy ("addresses the user's question," "does not invent an account number," "stays under 200 tokens"). Useful when many surface forms are correct.
- A reference behavior — what tools the agent should call, in what order, with what arguments. Useful for tool-using agents where the answer is mostly the trace, not the final string.
The judging is one of: exact match, structural match, an LLM-as-judge prompt, or a small panel of human raters for the cases that matter most. Most production teams use a blend — exact match for what can be matched exactly, LLM-as-judge for everything else, and human spot-checks for the regressions that look surprising.
The strength of offline is reproducibility. The same bench against the same candidate returns the same score; that is the whole point. You can A/B prompt edits at your desk in seconds, not days. The weakness is coverage. The bench encodes only the failures you imagined when you wrote it. The first time a user asks a question your bench did not anticipate, the candidate either handles it or doesn't — and you find out from the support inbox.
Online: the stream you cannot replay
An online eval is a scorer attached to live traffic. Each model decision the user actually received gets judged after the fact, asynchronously, and the verdict lands in your metrics pipeline. The judge can be:
- Heuristic — pattern matches over the response ("did it call a tool? did it refuse? did it produce code?"). Cheap, fast, narrow.
- LLM-as-judge — a separate cheap model rates each response on a small rubric. Anthropic's Building Effective Agents and OpenAI's Evals framework both lean on this pattern; it is the workhorse of online judging in 2026.
- Implicit signal — did the user accept the agent's answer, ask a clarifying follow-up, abandon the session, or escalate to a human? These are noisier than explicit ratings but they are what the business actually pays for.
Online judging is where you discover the failure modes your bench missed: the user phrasings you didn't anticipate, the seasonal drift in topic mix, the subtle regressions that one model version introduced for one user segment. The cost is that you are judging after the user already saw the response. Online is detection; it is not prevention. That is why you want offline at PR time and online running continuously — different points in the timeline, both necessary.
| Property | Offline bench | Online stream |
|---|---|---|
| When it runs | Pre-deploy, on demand | Continuously, on live traffic |
| Inputs | Fixed golden corpus | Real user requests |
| Reproducibility | Deterministic per candidate | One-shot per request, never replayed identically |
| Coverage of your imagination | 100% | 0% |
| Coverage of users' reality | Whatever you remembered | Whatever they sent |
| Cost per eval | Pennies (you control batch size) | Per-request judge cost × live volume |
| What it gates | Whether to ship the candidate at all | Whether to keep ramping the rollout |
The trap of relying on only one half
A team that runs offline alone ships clean candidates that crater on a user phrasing nobody on the team uses. A team that runs online alone discovers regressions only after their users do — and only on traffic patterns that were already common, never on the rare ones that disproportionately matter.
Two failure stories worth keeping in mind:
- Offline-only. A retrieval agent passes its 200-case bench at 94%. Three days after the rollout, support tickets spike: a user demographic that asks questions in fragments ("balance? checking, not savings") gets routed to the "send a refund" tool instead of the "lookup balance" tool. The bench had no fragmented inputs because the team writes English sentences when they author test cases. The problem was structural, not stochastic.
- Online-only. A team without a golden bench ships a prompt edit at noon. By 3pm the LLM-as-judge metric drops 4 points. By 5pm they are still arguing whether the drop is real or a judge regression. With a pre-shipped bench they would have caught the change in 30 seconds at PR time.
The offline-online split is one of the cheapest disciplines in this whole module. The bench is a folder of YAML or JSON; the online judge is a sidecar that scores spans from Module 2's traces. Both can be in production within a sprint, and after that they pay rent forever.
What goes in the golden bench
Every team eventually settles on a bench shaped like a portfolio. The four buckets that recur across production teams:
- Happy path representatives — the most common 5–10 task shapes, each with 3–5 phrasings. These should pass at >95%. When they don't, you have a regression on your largest bucket.
- Edge cases that have already burned you — every postmortem-grade bug becomes a bench case. This is the single highest-value bucket per case added; it grows naturally as the agent matures.
- Adversarial inputs — prompt injections, malformed inputs, jailbreaks, and the lethal-trifecta combinations from the Foundations Module 8 — each paired with the safe behavior. A regression here is shipping a security fix backwards.
- Counterfactuals — same intent, deliberately reworded; same words, different intent. Tests that the agent generalizes intent rather than pattern-matching on phrasing. The hardest bucket to write and the most valuable when it catches something.
In our experience and across published practitioner write-ups, teams commonly start at 50–200 cases. Below ~50 the bench is too noisy to gate on; past ~500 the bench-run latency starts to discourage running it on every PR. Most teams settle in the 100–300 range long-term, with the bench growing one case per shipped postmortem — exact numbers vary by domain and model.
What the online judge actually measures
The trap with online judging is grading the wrong thing. A judge that scores "is this response well-written prose?" will love a beautifully-written wrong answer. The questions that pay off in production:
- Did the agent satisfy the user's request? (Outcome quality.)
- Did it call the right tools, in the right order, with the right arguments? (Behavioral correctness — only checkable when you have the trace.)
- Did it stay inside policy? (Safety / refusal / disclosure rules — most useful pair with Module 3's output filters.)
- Did the user signal acceptance? (Implicit — the most honest metric and the noisiest.)
Pick two or three of these as your top-line online metrics; resist the temptation to measure twelve things equally. The eval-driven rollout in Step 5 will gate on these — and a gate that is the geometric mean of twelve metrics is a gate that means nothing.
Where this fits in the rollout pipeline
Offline is the gate at the prepare stage. Online is the gate at the operate stage. The intermediate stages — shadow, A/B — are the bridges between them, and Steps 2 through 5 of this module each cover one. The module's spine: offline catches the easy regressions, shadow catches the diffs you didn't predict, A/B measures the user-visible delta, drift detection notices when the world changes underneath you, and an eval-driven rollout policy ties them all into the deploy pipeline.
Try it: Stage 1 of the right pane shows both eval modes side by side. The top panel is an offline bench of 12 golden cases comparing v1.4 (baseline) and v1.5 (candidate); 2 cases regress (DANGER red) and 1 marginal (WARN amber) — bench score 87.5%. Click "Re-run bench" and notice the result is identical: that deterministic property is what makes offline a useful gate. The bottom panel is an online stream — request bubbles flowing right, getting a quality verdict at the judge line, ticking up the cumulative scoreboard. Watch the scoreboard climb past 100 requests and notice that even at high pass-rate, individual failures still show — that's the bucket the offline bench could not have predicted.
Shadow Mode
What is shadow mode?
Shadow mode is running a candidate version side-by-side with production on real live traffic, without exposing the candidate's output to users. Each request still flows to the production model, the user sees the production answer, but the harness also fires the same request at the candidate model and logs its response. A diff metric — how often the two responses agree, how often they semantically diverge, how often they differ on something hazardous — accumulates. You ship the candidate only when the diff metric clears a bar.
Shadow mode is the safest way to evaluate a candidate that the offline bench has already cleared. The bench told you it is not obviously broken; shadow mode tells you it would behave roughly the same as production on the actual distribution your users send. Those are different questions, and both deserve answers.
The mechanics: fork, log, compare
incoming request ──► harness ──┬─► production model ──► response ──► user
│
└─► candidate model ──► response ──► log
│
▼
comparator
│
┌───────────────────────┴───────────────────────┐
▼ ▼
AGREE counter DIFF counter
Three moving parts:
- The fork. The harness duplicates the input and dispatches it to both versions. If the candidate is on a separate compute path it can run truly in parallel; if it is on shared infra you may serialize to avoid resource contention. Either way, the user does not wait — the candidate's response is fire-and-forget.
- The log. Both responses get persisted with the same correlation ID. This is the same span infrastructure from Module 2, with one extra attribute per pair:
comparator_outcome. Every diff is a row in your data warehouse. - The comparator. A judge runs over the (input, prod_response, candidate_response) triple and emits a verdict — agree, semantic diff, hard diff. Comparator design is the part teams underweight; we'll spend most of this step on it.
The user-visible cost is zero. The infrastructure cost is one additional model inference per shadowed request, which is why you sample. The most common shadow rates in production are 1%, 5%, or 25% of traffic — enough to get a statistically meaningful diff sample over a day, not so much that the candidate's compute bill rivals production.
What "agree" means — and how to compute it
The naive comparator is exact-string equality. For an agent it is almost useless, because the same correct answer can be phrased a thousand ways. Three patterns that work:
- Structural diff for tool-calling agents. If the agent's job is to emit tool calls, the comparator checks that both versions called the same tools with semantically equivalent arguments. Strings need not match; the agent's behavior must. This is the most precise diff you can compute and the one most worth investing in.
- LLM-as-judge for prose. A cheap model (or the same model with a judging prompt) scores each pair on a 0–4 rubric: "do these responses convey the same information, lead the user to the same action, and stay inside the same safety boundary?" The OpenAI Evals framework and Anthropic's Building Effective Agents both publish prompts in this shape; treat them as starting points and revise to your domain.
- Embedding cosine + threshold for retrieval-shaped tasks. Embed both responses, threshold the cosine similarity, and bucket. Cheap, fast, and semantically tighter than string match. Less rigorous than LLM-as-judge but usable as a coarse first filter.
The comparator's verdict is itself an evaluation. A wobbly judge produces a wobbly diff, and a 4% disagreement rate that is mostly judge noise is worse than no shadow at all because it teaches the team to ignore the metric. Audit the judge before you trust the metric: sample 100 disagreements at random, have a human grade them, and see whether the judge's "diff" calls actually correspond to user-visible quality changes. If not, the comparator is the bug, not the candidate.
Three diff buckets and what each one means
Once the comparator is sound, the diffs split naturally into three buckets, each with a different action:
- Agree (the supermajority). Both versions produced semantically equivalent answers. Routine. The interesting question is the rate — a 96% agree rate on a small candidate change feels right; a 96% agree rate on a "no-op" prompt edit means something is wrong with the comparator or the prompt edit was not actually a no-op.
- Semantic diff. The two answers differ in style, framing, or phrasing but lead the user to the same action. Often the intent of the candidate change. A new persona prompt should produce a high semantic-diff rate; a bug fix should not.
- Hard diff. The two answers contradict each other, or one calls a tool the other refused, or one produces a number the other doesn't. This is the bucket that gets you fired. A non-zero hard-diff rate is the gate that decides whether the candidate ramps or holds.
A useful rule: pre-register the target diff distribution before you ship the candidate. "This prompt edit should raise semantic diff to ~30% (we're rewording responses) and hold hard diff at 0%." Now the shadow run has a falsifiable hypothesis. Without that pre-registration you will look at the diff numbers and rationalize whichever direction they moved.
When shadow mode is the right tool
Shadow shines on three shapes of change:
- Model upgrades. Swapping
claude-opus-4-7-20251215for the next snapshot, or swapping a 7B model for a 13B. The bench can verify it is not obviously worse; only shadow can verify it produces compatible answers on real distribution. - Prompt rewrites of any size. Tightening a system prompt, adding a new tool description, restructuring the few-shot examples. Anything that affects the model's response distribution.
- Tool schema changes. New optional argument, renamed field, tightened JSON schema. Shadow shows whether the candidate handles the new schema as you expect on traffic the bench did not include.
It is poorly suited for changes that should produce different outputs by design — e.g. a new tool that adds a capability, or a prompt that adds a refusal where there used to be a response. For those, the diff rate will be 100% and the shadow signal is uninformative; jump to the A/B harness in Step 3 instead.
Statistical sample size — the part teams skip
A 1% shadow rate over an hour at 10,000 requests/hour gives you 100 paired observations. That is enough to detect a 10% diff-rate change at maybe 80% confidence; below that you are reading tea leaves. The math you want loaded:
For a binary outcome (diff vs agree) with baseline rate p,
detecting an absolute change of Δ at 80% power and α=0.05 needs
roughly n ≈ 16 · p(1-p) / Δ² paired observations.
For a baseline of 5% diff and a target detection of 1% absolute change, that's roughly 7,600 paired observations (16 × 0.05 × 0.95 / 0.01²) — under a day at moderate sample rates, multiple days at low ones. The lever you tune is sample rate vs window length; both increase n. The decision on which to bias depends on whether your traffic mix shifts within the day (long windows blur it; high sample rates within a short window catch it).
What invalidates a shadow run
Three structural pitfalls:
- The candidate sees stale state. If the candidate calls tools, those tools mutate state — and a shadow that also runs the tools either double-charges your DB or pollutes the production state. Most teams stub all stateful tools in the shadow path: it gets a recorded response from the production trace, not a fresh call. This is the single most common shadow-mode bug; budget a sprint for it on first deployment.
- Latency-driven divergence. If the production path returns in 2 seconds and the candidate path returns in 8, you cannot pair the responses for a multi-turn conversation — the user has already moved on. Single-turn shadow is straightforward; multi-turn shadow against a stateful agent needs careful design. Most teams shadow only at the single-turn boundary or accept that multi-turn shadow rates are necessarily lower.
- Sampling skew. If the harness shadows only requests with low load (cheap to fork), the diff rate is unrepresentative — you are evaluating the candidate on the easy slice. Sample uniformly across load, segment, and time-of-day. A daily diff dashboard sliced by segment surfaces this fast.
The shadow gate is one of those metrics that breaks silently when the underlying mechanism breaks. Treat the diff metric like any other dashboard and audit it on a cadence — not because the metric will lie, but because the mechanism that feeds the metric can decay without anyone noticing.
How the gate decides "ship"
The decision rule that recurs across teams:
- Hard diff rate < 1% — non-negotiable. A higher hard-diff rate means the candidate is producing materially different answers on real traffic, and the rollout pauses until the team understands why.
- Semantic diff rate within the pre-registered band. The expected band depends on the change. A no-op refactor should be near 0%; a persona rewrite might land at 30%.
- Agree rate trends stable across segments. A 96% overall rate that hides a 70% rate on one user segment is a regression on that segment; the rollout pauses on segment-level slices, not global averages.
A clean shadow run, sustained over enough samples, is the green light to start the A/B ramp in Step 3. A shadow run that lights any of these gates red is a hold; the candidate goes back to the offline bench with the failing diffs added as new test cases.
Try it: Stage 2 of the right pane is a shadow-mode pipeline. Live traffic enters the harness fork, splits into a production path (v1.4 → user) and a shadow path (v1.5 → log only), and both responses meet at a comparator that ticks up AGREE vs DIFF. The metric tiles to the right break the diff rate into Agree rate, Semantic diff, and Hard diff — the three numbers your gate actually keys on. Drag the shadow sampling slider from 1% to 100% and watch the diff sample size grow. Move the agree-threshold dial between 90% / 95% / 99%; the verdict pill at the right flips between "READY TO RAMP" and "HOLD" as the agree rate crosses your threshold. The teaching: the gate is not the candidate's quality score in isolation — it's the diff against production, on real traffic.
A/B Harness
When should you use an A/B harness?
When the candidate is too risky to ship at 100% but you need real-user signal that offline evals and shadow mode cannot give you. Shadow tells you what the candidate would have said; A/B tells you what users actually do with it — completion rate, follow-up rate, escalation rate, conversion. Those metrics are the ones the business pays for, and they are visible only when real users see real answers and act on them.
A/B is the appropriate tool when shadow mode has cleared the candidate (no hard diffs, semantic diff in the pre-registered band) and the team needs to measure user-visible deltas before ramping to 100%. If shadow already produced a clean verdict and the change has no user-visible behavioral surface — say, a refactor that should be a strict no-op — A/B adds traffic risk without adding signal. Skip it. If the change has any user-visible surface, the cost of running an A/B is small compared with the cost of a regression you don't see until weeks later.
The mechanics: split, ramp, measure
Three moving parts:
- The split. The harness routes a fraction of incoming traffic to the treatment version and the rest to the control. The split must be sticky per user (or per session) — bouncing a single user between versions mid-conversation is both confusing and statistically nonsensical. Hash on user ID, take a stable bucket, route accordingly.
- The ramp. A schedule of split percentages. A canonical ramp is 1% → 5% → 25% → 50% → 100%, with each step held long enough to gather a statistically meaningful sample for the pre-registered metric. Each step is gated on the metric being non-regressed before the harness advances.
- The metric. A pre-registered, top-line outcome metric that the rollout decision is keyed on. Pre-registered means written down before the experiment starts. Without that, the metric becomes a moving target — the team picks whichever metric points in the direction they wanted, which is not measurement, it is rationalization.
The whole apparatus is one big control system: split routes traffic, ramp determines exposure, metric reads the result, decision rule advances or rolls back. Each piece has failure modes, and most A/B mistakes come from mixing them up.
Pick the metric before you start
The single most common A/B mistake is metric shopping — running the experiment, seeing fifteen metrics move in different directions, and picking the one that moved in the direction you hoped. This is statistically meaningless. With fifteen independent metrics at α = 0.05, you expect roughly one to look "significant" by pure chance. The more dashboards you stare at, the more confident you become in a result that is noise.
The discipline is to pre-register the metric. Before you start the ramp, the team commits — in writing, in the rollout PR, in the eval doc, somewhere with a timestamp — to one primary metric and a small set of guardrail metrics. The primary is what you are trying to move. The guardrails are what must not regress. Everything else is exploratory and is not allowed to swing the decision.
What makes a good primary metric for an agent rollout:
- It directly reflects user value. Completion rate, escalation rate, time-to-resolution, conversion. Not token count, not cost — those are guardrails, not primaries.
- It moves enough to be measurable. A metric that fluctuates by 3% on weekends needs more samples than one that holds steady to ±0.5%.
- It is hard to game. "Tool calls per request" is gameable by changing the prompt; "user-rated success" is harder.
- It is robust to outliers. Use medians and trimmed means rather than raw averages when latency or session length is the metric — one user who left their session open over the weekend should not move the dial.
A typical Agent A/B's pre-registration looks like:
PRIMARY: completion rate within 3 turns
GUARDRAILS: p95 latency ≤ control + 200ms · cost per task ≤ control × 1.10
RAMP: 1% (24h) → 5% (48h) → 25% (72h) → 50% (72h) → 100%
DECISION RULE: at each ramp step, primary is non-regressed at 95% CI AND
both guardrails are within bounds. Otherwise: hold or roll back.
Once this is in the PR, the experiment has a falsifiable shape. The team is allowed to look at exploratory metrics for understanding, but the decision is bound to the pre-registered ones.
Sample size — the part the gut gets wrong
The arithmetic that sets honest A/B expectations:
For a binary primary (e.g. completion: yes/no) with baseline rate p,
detecting an absolute change Δ at 80% power and α=0.05 needs roughly:
n_per_arm ≈ 16 · p(1-p) / Δ²
For a baseline of 80% completion and a target detection of 2 absolute percentage points:
n ≈ 16 · 0.8 · 0.2 / 0.02² = 16 · 0.16 / 0.0004 = 6,400 per arm
That is per arm — both the control and treatment need 6,400 samples each. At a 5% treatment ramp on 100,000 requests/day, the treatment arm gets 5,000 requests/day; you need a day and a half of clean data before you have a defensible read. The control will have collected far more, but the treatment is the bottleneck.
Most teams who claim to be running A/B at 1% are not in fact getting a usable signal at any reasonable cadence. Either ramp faster, accept a noisier read, or be explicit that the 1% step is a smoke test for hard regressions, not a quality signal.
The simulation's "significance" bar fills as the treatment sample size grows: below ~500 it stays grey ("inconclusive"); 500–2,000 fills amber ("weak signal"); above ~2,000 fills green ("decision ready"). Those thresholds are illustrative cues for signal readiness, not a substitute for the per-effect-size calculation above — a 2pp detection at this baseline still needs the full 6,400 per arm before the result is rigorous. The qualitative shape — very small samples are decoration, not data — is universal.
The ramp schedule is the safety mechanism
Why ramp at all? Why not split 50/50 from day one and be done in a day?
Because the cost of a bad treatment scales with exposure. At 1% exposure, a 10× regression on the primary metric affects 0.1% of total traffic — recoverable. At 50% exposure, the same regression affects 5%, and your support inbox notices before your dashboard does. The ramp keeps the blast radius bounded while the eval signal accumulates.
A workable ramp design:
- Step 1 (1%, 24h): smoke test for catastrophic regressions. Looking for the candidate to not crash, not refuse everything, not produce hard-diff outputs that the shadow phase missed. Sample size is too small for primary metric reads; the gate here is "no obvious breakage."
- Step 2 (5%, 48h): weak signal on primary metric. Will catch a 5%+ absolute regression at decent confidence. If primary is regressed beyond the band, hold or roll back.
- Step 3 (25%, 72h): strong signal on primary. Will catch a 1–2% absolute regression. The decision step that authorizes the bigger ramp.
- Step 4 (50%, 72h): confirms behavior at scale, exposes any contention or rate-limit issues that 25% didn't surface.
- Step 5 (100%): full rollout, treatment becomes the new control. Old version stays warm in the rollout system for one-command rollback.
Each step is its own gate with its own decision. Skipping a step is a choice the team should make explicitly — sometimes appropriate (a strict no-op refactor), often not (a model swap).
Sticky assignment, segment slicing, and the clean-vs-cohort question
Two structural details that decide whether the A/B is meaningful:
Sticky assignment. A user routed to treatment in turn 1 must stay in treatment for turns 2 and 3 of the same conversation. Otherwise the user experiences both versions in the same conversation, the responses are inconsistent, and the data is contaminated for both arms. Hash by stable user ID or session ID, not by request ID.
Segment slicing. The aggregate primary metric can hide segment-level regressions: treatment is +1% on the global metric, +3% on power users, and -4% on first-time users. A rollout that ships on the global number ships a regression on first-time users. Slice the primary metric by the segments that matter to your business — region, plan tier, user tenure — and require the primary to be non-regressed per segment before advancing.
The cost of slicing is sample size. If you require non-regression on five segments, each segment needs its own sample, and the bottleneck segment gates the ramp. Most teams pick two or three segments that they are unwilling to regress on and accept aggregate signals for the rest.
When the A/B disagrees with shadow
A common pattern: shadow mode said the diff metrics were clean, A/B says completion rate is down 1%. What now?
Three usual culprits, in order of frequency:
- The shadow comparator was lenient. It judged "agree" on responses that were structurally similar but functionally different. Audit the comparator with human graders against an A/B disagreement sample.
- The user behaves differently than the shadow predicted. The candidate's response is technically equivalent but practically harder to act on — longer, more verbose, missing a useful affordance. Shadow scored the response in isolation; A/B scored what the user did with it.
- The metric the A/B is measuring is downstream of something the shadow doesn't see. Cost per task, p95 latency, completion across multiple turns. Shadow can't measure cross-turn effects; A/B can.
The right response is to add the failing-A/B examples back into the offline bench, augment the comparator's rubric with the missed dimension, and re-run the candidate through the bench → shadow → A/B pipeline with the strengthened gates. This is exactly the loop Step 5's eval-driven rollout codifies.
What can go wrong even when the math is right
A few patterns that cost teams real money:
- Looking at metrics during the ramp. "We've seen treatment up 0.5% — let's accelerate the ramp." This is the same statistical sin as metric-shopping in the time dimension. Pre-register the schedule and stick to it; only halt early on guardrail breaches, never advance early on primary movement.
- Stopping early on a positive read. A treatment that is +1.5% at 5% ramp can be 0% at 25% ramp once the sample includes more diverse users. The early read is the noisier read. Ramp the full schedule before declaring success.
- Using A/B to compare three or four versions at once. With three arms and three pairwise comparisons, your effective α inflates and you over-report significance. Multi-arm tests are a real technique with real correction (Bonferroni, FDR), but the discipline is rare and the right default is binary.
- A/B for changes that should produce different outputs. A new tool, a new refusal policy, a deliberate persona shift. The A/B will show a metric delta, but the delta is by design — not a quality signal. Use shadow for the diff exploration and A/B only for the user-acceptance dimension.
A/B is a sharp tool. Run sparingly, run rigorously, and let the schedule make decisions the team is tempted to make from gut feeling.
Try it: Stage 3 of the right pane is an A/B harness in motion. The incoming traffic stream hits a split node (showing the current ramp %) and forks into a Control lane (INDIGO bubbles, v1.4) and a Treatment lane (BLUE bubbles, v1.5), each accumulating in a metric panel — completion rate, p95 latency, cost/task, sample size, with ±CI bars that visibly shrink as samples grow. Move the ramp slider from 1% to 100% and watch treatment density and CI tighten together. Toggle "Pre-registered metric" off and watch all three metrics flash with a "metric shopping" warning — that's what looking-at-everything looks like in graphical form. Watch the significance bar crawl from grey ("inconclusive") through amber ("weak signal") to green ("decision ready") as treatment sample size accumulates.
Drift Detection
What is drift detection?
Drift detection is monitoring for distribution changes in your inputs (the things users ask) or your outputs (the things your model says back) — and alerting when those distributions move enough that yesterday's evals do not necessarily describe today's behavior. The bench you built two months ago graded a candidate on what users were asking two months ago. If the user mix has shifted, the bench is now grading the wrong question, and the gate it provides is increasingly fictitious.
Both directions matter and they mean different things. Input drift says the world changed — your users started asking different questions, in different proportions, with different framings. Output drift says your system changed — same users, same questions, but your model started responding differently. Both signals demand action; the action is usually different in each case.
Why agents drift more than classical ML
Classical ML models are static once trained — only the input distribution moves. Agents have at least three drift surfaces:
- Input drift. New user segment onboarded, marketing campaign brings a different demographic, competitor outage routes their users to you, seasonal topic shift. These are the same drifts a sentiment classifier would face.
- Output drift from your own changes. A prompt edit, a tool schema change, a new few-shot example. Tracked by deployment metadata; visible the second the change ships.
- Output drift from upstream changes you don't control. Provider model snapshots auto-roll forward, a tool downstream returns slightly different formats, a vector index re-indexes overnight. These are the silent ones. The model "version" your dashboard reads says the same string; the behavior moves anyway.
The third bucket is the one drift detection earns its rent on. Without it you find out from the support inbox; with it you find out within the day, often within the hour, and you can roll back the upstream cause before the user-facing metric reflects it.
Measuring drift: PSI as the workhorse
The metric that has dominated production drift detection for two decades, and continues to in agent-shaped systems, is Population Stability Index (PSI). It compares two distributions across the same set of bins and emits one number that is intuitive once you've seen it twice.
PSI = Σ over bins of (current_pct − baseline_pct) × ln(current_pct / baseline_pct)
Rule of thumb:
PSI < 0.10 ─► no significant drift
PSI 0.10–0.25 ► moderate drift, investigate
PSI > 0.25 ─► significant drift, re-run evals or roll back
The arithmetic does what it sounds like — bins where the proportion grew or shrank contribute positive values; bins that stayed flat contribute zero. The logarithm penalizes large relative shifts more than small absolute ones, which matches operational intuition: a small bin that doubles in size is a bigger story than a large bin that grew 5%.
The thresholds (0.10, 0.25) are conventions, not laws — your tolerance depends on how stable your users were before, how expensive a false alarm is, and how reactive you can afford to be. Most teams calibrate the thresholds against their own historical data: bin a year of traffic into weekly buckets, compute PSI between adjacent weeks, look at the distribution, set thresholds at the 95th and 99th percentile of normal week-over-week movement.
What to bin — input side
The trick with PSI is choosing bins that are stable, interpretable, and load-bearing for your agent's behavior. For inputs:
- Intent categories. Cluster historical inputs into a fixed taxonomy (billing, tech support, refund, feature ask, complaint, lookup, …). Classify each new input against the taxonomy and bin. The taxonomy itself drifts over time and needs periodic re-clustering — that re-clustering cadence is itself an operational decision.
- Embedding clusters. Embed each input, project to a fixed set of K means clusters trained on the baseline period, bin by closest cluster. More automatic than a hand-curated taxonomy; harder to explain when a bar moves.
- Surface features. Input length quartiles, presence of code blocks, language ID, sentiment score. Cheap to compute, less semantically meaningful, but a surprising amount of operational drift shows up here first.
- Tool category requested. If the agent decides which tools to call, the requested tool distribution is itself an input proxy — a shift toward the "refund" tool means the world is asking about refunds more, regardless of phrasing.
Most teams bin on two or three of these and surface them as separate PSI dashboards. A single composite PSI is too coarse; per-dimension PSI tells you what moved.
What to bin — output side
The output side is where drift detection becomes uniquely valuable for agents, because output drift is the early signal for a regression. Useful binning:
- Response category. Did the agent answer, ask a clarification, refuse, call a tool, apologize, escalate, or flag a hallucination? Bin every response by the highest-level type. A spike in "refused" or "apologized" without a corresponding input change is a regression signal that often precedes user complaints by hours.
- Tool-call patterns. Which tools got called, in what order, with what arguments. A drift toward "no tool called" when historically the agent called 2.3 tools per task is a behavior change worth investigating.
- Length and complexity. Output token count, number of paragraphs, presence of code blocks. A model upgrade often shows up here first — the new model is more verbose by 30% on the same inputs.
- Sentiment / tone bins. Apologetic, confident, hedged, declarative. Tone drift after a prompt edit is intentional; tone drift without a deployment is a flag.
Reading the four-quadrant matrix
Once you have input PSI and output PSI on separate dashboards, the joint reading is one of four quadrants:
| Output PSI low | Output PSI high | |
|---|---|---|
| Input PSI low | Steady state. Trust your evals. | Your system changed without an input shift. Roll back recent changes; check provider model snapshot. |
| Input PSI high | Users changed; system absorbed it. Re-run evals on the new mix to verify. | Both moved. Disambiguate by holding inputs fixed (replay a sample through the candidate); see Step 2 of Module 7. |
The diagonals are the boring quadrants. The off-diagonals are the storytelling ones. The system-changed quadrant (input flat, output drifting) is the one that catches silent regressions; the world-changed quadrant (input drifting, output flat) catches the moment your evals stopped describing your traffic.
Reference period: the question you have to answer first
Drift is a comparison, and every comparison needs a baseline. The choices, with their tradeoffs:
- Fixed reference period. The baseline is a snapshot — the week before your last big release, the month after the agent went GA. Drifts are relative to that fixed point. Simple, stable; makes long-running drift visible. Risks treating intentional product evolution as drift.
- Rolling reference window. The baseline is the last N days, sliding forward each day. Drifts are relative to recent normal. Adapts to slow change automatically; misses gradual drift that accumulates within the window.
- Multiple references. Compute PSI against both fixed and rolling, surface both, alert on either. Most production teams end up here — fixed catches creep, rolling catches sudden change.
The right default depends on the rate at which your business is changing. A startup whose user base doubles every quarter wants a rolling baseline; a bank whose customer mix changes glacially wants a fixed one. Pick consciously, not by default.
What to do when the gauge lights up
A PSI alert is a flag, not a verdict. The on-call routine for an agent fleet, distilled:
- Confirm the drift is real. Look at the contributing bins. Is one bar 3× its baseline, or is the metric being pushed up by noise across all bars? Real drifts have a story; statistical noise has a uniform smear.
- Decide which side moved. Input PSI alone, output PSI alone, or both? Use the matrix above to scope the action.
- Re-run the offline bench against current traffic. Sample 200 recent requests, run them through both prod and candidate (or just prod, if no candidate is in flight), and check the bench score against the historical bench score. If the bench score on current traffic is materially lower than on the original golden corpus, the bench is now scoring on a different distribution — augment it.
- Check upstream metadata. Did your provider ship a model snapshot? Did a tool downstream change response format? Did your retrieval index re-index? These are the cheap diagnoses.
- Decide: roll back, hold, or accept. Roll back a recent deploy if output drift correlates with the deploy timestamp. Hold a candidate ramp if drift signals a hot environment. Accept the new normal and update the baseline if the drift is a deliberate product change.
The discipline is not "respond to every alert." It is "have a defined response for each quadrant of the matrix, and execute it." A drift dashboard with no playbook is a dashboard nobody looks at by week three.
What drift detection is not
Two clarifications worth making explicit:
- Drift is not a quality signal. A high PSI does not mean the agent is performing worse; it means the agent is performing on different inputs than yesterday. Quality is the eval suite's job. Drift is the trigger that says "the eval suite needs to re-run."
- Drift detection is not a substitute for online evals. Drift tells you the distribution moved; online evals tell you the quality moved. Both need to be running, and an alert on one should trigger a check on the other.
The combination is what makes the loop close. Drift catches the signal that something changed; online evals quantify whether that change moved quality; offline evals provide the gate before the next candidate ramps.
Try it: Stage 4 of the right pane is two PSI histograms side-by-side — Input (8 bars: billing, tech support, refund, lookup, feature ask, complaint, praise, other) and Output (8 bars: answered, asked-clarify, refused, tool-called, apologized, escalated, hallucinated-flag, other). Both start at PSI 0.00 (baseline = current — perfectly steady). Click "+ New user segment": input bars rebalance, input PSI jumps into the WARN/DANGER band, output stays flat — the world changed. Click "+ Model self-update (silent)": output bars rebalance, output PSI jumps, input stays flat — your system changed. Click "+ Data poisoning": output's "hallucinated-flag" bar spikes. The colors of the PSI numbers (SAFE / WARN / DANGER) match the standard 0.10 / 0.25 thresholds.
Eval-Driven Rollout
What is eval-driven rollout?
Eval-driven rollout is shipping a new prompt, model, or tool only when a defined sequence of eval gates passes their pre-registered thresholds. The pipeline reads the eval result and either advances the rollout one step or halts it; humans approve the thresholds in advance, not the individual deployments. Each gate has a specific failure mode and a specific question; together they form a chain that catches the regressions Steps 1 through 4 each detect in isolation.
The alternative — let humans look at metrics and decide whether to ramp — sounds responsible. In practice it is the most common source of metric-shopping rationalization in agent operations. The team ships, looks at the dashboard, finds the metric that moved in the right direction, and ramps. Eval-driven rollout removes that judgment moment by writing the decision rule down before the experiment starts; the human role is to set the rule, audit the rule, and update the rule when it learns something.
The six gates that compose a defensible rollout
PR open ─► CI: types + unit ─► Offline evals: golden set ─►
│
Shadow: 24h ─► Canary: 5% A/B ─► Ramp: 100%
│
▼
Old version warm
for one-command rollback
Each gate is a thin layer of catching. Together they catch nearly everything; alone, none catches enough.
Gate 1 — PR open. A change against the prompt, tool schema, or model selector lands in code review. The PR template requires the author to declare what changed, what bench they expect to move, and the threshold for success. This is the pre-registration moment for everything downstream.
Gate 2 — CI: types + unit. The candidate compiles, schema migrations are valid, and any deterministic tools the agent calls have unit-test coverage. Cheap, fast, catches the typos that would otherwise embarrass you in production.
Gate 3 — Offline evals: golden set. The candidate runs against the bench from Step 1. The PR is gated on: zero hard regressions on critical cases, total bench score within X% of baseline, no novel safety failures. This is where the bulk of caught regressions die — most "this prompt edit doesn't ship" decisions happen here.
Gate 4 — Shadow: 24h (or to N samples). The candidate runs in shadow mode against live traffic per Step 2 until the diff sample size hits the pre-registered N. Hard diff < 1%, semantic diff inside the band, segment-level diffs all clean. This is the gate that catches the regressions your bench didn't anticipate.
Gate 5 — Canary: 5% A/B (and the ramp steps after). The candidate goes to 5% of users per Step 3's harness. Pre-registered primary metric is non-regressed at 95% CI; guardrail metrics within bounds. If clean, ramp to 25%, 50%, 100% by the schedule. If not, halt.
Gate 6 — Ramp: 100%. Treatment becomes the new control. The previous version stays warm in the rollout system for one-command rollback (a discipline the next module covers). Eval gates 3 and 4 reset their baselines for the next candidate.
A clean candidate that passes 1 → 6 has cleared deterministic checks, distribution checks, behavioral checks against real traffic, and user-visible quality checks at sub-fleet exposure before a single user at scale sees it. That is the bar.
What a "threshold" actually looks like
The most useful single discipline is to write thresholds in code, not in heads. Three concrete shapes:
Bench threshold (Gate 3):
gates:
offline_bench:
bench_id: golden-v3
pass_if:
- critical_cases_passing: 100%
- total_score: ">= baseline - 1.5pp"
- safety_failures: 0
- new_regressions: "<= 2"
Shadow threshold (Gate 4):
gates:
shadow:
sample_n: 5000
pass_if:
- hard_diff_rate: "< 1%"
- semantic_diff_rate: "inside_band(0%, 35%)"
- per_segment_agree_rate: ">= 90%"
Canary threshold (Gate 5):
gates:
canary_5pct:
duration_min_hours: 24
sample_n_per_arm_min: 5000
pass_if:
- completion_rate_delta: ">= -0.5pp at 95% CI"
- p95_latency: "<= control + 200ms"
- cost_per_task: "<= control * 1.10"
A change to any of these thresholds is itself a PR — reviewed, dated, attributed. Lowering a threshold to make a candidate pass is a smell that the team should discuss out loud, in the PR, before merging. Most regressions ship through that conversation, not through an actual metric breach.
Quality threshold: the dial that reveals the tradeoff
The quality threshold at Gate 3 is the single most consequential parameter in the whole pipeline, and it is the one teams underweight. Set it too loose, and regressions slip through to gates 4 and 5 — caught, but expensive in shadow time and canary risk. Set it too tight, and clean candidates get blocked, ship velocity drops, and engineers learn to game the bench rather than the underlying agent quality.
There is no universal right value. The honest calibration is empirical:
- Replay the last 12 months of production incidents through the bench. For each incident, ask: would the bench at threshold T have caught it?
- Replay the last 12 months of clean candidates through the bench. For each, ask: would threshold T have blocked it?
- Plot caught-regressions / blocked-clean-ones across thresholds. The knee of that curve is your starting threshold.
- Re-calibrate quarterly as the bench grows.
In our experience, teams typically converge on a threshold where the bench catches the majority — perhaps 70–90% — of would-be regressions while blocking only a small fraction of clean candidates. The exact split depends on bench quality and domain; what matters is that the residual regressions are exactly what gates 4 and 5 are for. The system is layered for a reason.
Auto-rollback is not optional
A rollout pipeline that can ship is incomplete; a rollout pipeline that can roll back is operational. Two patterns:
- Gate-failure rollback. When a downstream gate (Gate 5's primary metric, or any guardrail) trips its threshold, the pipeline automatically rolls the rollout back to the previous version without paging a human first. The page comes after the rollback, with the diagnostic context attached. The discipline: a regression caught at Gate 5 should never be a 30-minute on-call debug; it should be a 30-second auto-revert and a postmortem.
- Manual one-command rollback. A human can revert the active version to any previous version with one command — and the previous version is kept warm precisely so this command is fast. Vercel's Rolling Releases (GA June 2025) and similar deploy systems make this a first-class primitive; the harness should integrate with whatever deploy primitive your platform offers rather than rolling your own.
The combination — auto-rollback on quantitative gate breach, manual one-command rollback for everything else — is what lets the team ramp aggressively without panic. Drill the rollback in a non-production environment monthly so the muscle memory exists when you need it.
What to do when a gate fails
The pipeline halting is the system working. The team's job is to understand why it halted and decide what changes:
- Gate 3 fail. The candidate regressed against the bench. Add the failing cases as bench rows that must pass before the next candidate ships. If the failures are out-of-distribution from the bench, the candidate may be making a tradeoff the team will accept — but the conversation has to be explicit, in the PR, with a documented exception.
- Gate 4 fail. Shadow saw diffs the bench did not anticipate. Triage the diff sample, add the diff cases to the bench, augment the comparator if it was lenient. Re-run from Gate 3.
- Gate 5 fail (primary regressed). The candidate produces user-visible regressions the bench and shadow did not predict. This is the most expensive failure to debug. Pull a sample of failed cases, run them through the bench (which they should pass — that's the definition of out-of-distribution), and feed them back as new bench rows.
- Gate 5 fail (guardrail breached). The primary may be flat or up, but cost is up 15% or p95 latency is up 400ms. The candidate is not the right shape; the team revisits the change.
In all four cases the bench grows. The bench is a living artifact and the gate-failure feedback loop is what keeps it relevant.
The change-set is one unit
A subtle discipline that is worth naming: the prompt, the model version, and the tool schemas ship as a single versioned unit. Rolling back to "v1.4" means rolling back all three together; allowing them to diverge is how you end up with a v1.5 prompt running against v1.4 tool schemas, which is a configuration the bench never saw and shadow never tested. Most production agent platforms enforce this at the artifact level — a deploy bundle includes prompt + model_id + tool_set together, and the rollout primitive promotes or reverts the whole bundle atomically.
What the on-call sees
When the pipeline is operational, the on-call view is short and structured:
- Currently rolling: candidate ID, current ramp %, current gate, time remaining at this gate
- Last 5 candidates: each with end state (shipped, rolled back, halted at gate N) and reason
- Active gate metrics: live dashboard of the metric values vs threshold for the current gate
- One-click actions: pause ramp, manual roll-back, manual ramp-step
This is the view that lets the team sleep through small candidates and wake up only when something interesting halts. The eval-driven part is the on-call enabler — automation handles the routine; humans handle the exceptions and the threshold tuning.
Where the next module picks up
This module covered the evaluation substrate — what to measure, how to measure it, and what gates the measurements feed. Module 6 (Deployment & Rollout) covers the deployment substrate that the gates control: prompt-as-code discipline, canary mechanics, version pinning, and the rollback discipline that turns a halted gate into a recovered fleet. The two modules form a pair; eval-driven rollout is incomplete without a deployment system that can execute the verdict.
Try it: Stage 5 of the right pane is the 6-gate pipeline laid out left-to-right. Click "Clean candidate" and watch all six gates light SAFE in sequence. Click "Regressed candidate" and watch the pipeline halt at Gate 3 with a regression count above the threshold. Click "Slow-burn candidate" and watch it pass Gates 1–4, then fail at Gate 5 (canary regresses on completion). Drag the quality threshold loose and the regressed candidate slips past Gate 3 — a "regressions slipped through" counter ticks up next to the slider, the cost of being too permissive. Toggle Auto-rollback and watch a failed gate immediately drop the rollout indicator back to v1.4.
Further Reading
- Anthropic — Building Effective Agents — the eval-and-iterate loop that underpins their internal agent work.
- OpenAI — Evals framework — primitives for offline benches and LLM-as-judge.
- Vercel — Rolling Releases — first-class deployment primitive (GA June 2025) for the rollout substrate.
- Eugene Yan — Building reliable LLM systems through evals — practitioner's guide to eval design for production LLM systems.
- Hamel Husain — Field-tested LLM eval lessons — concrete patterns from running evals on real product traffic.