Learn AI VisuallyTracksAI Explained

AI Agent Evals & Diagnostics

Compounding Errors

What are compounding errors in agent systems?

Compounding errors are what happens when a model that succeeds at one step with probability p is run for N steps in sequence: the end-to-end success rate is roughly p^N, not p. A model that gets one tool call right 95% of the time gets ten tool calls right only about 60% of the time — the failure modes multiply because the agent has to be right at every step to finish correctly. This is the basic reason why per-step accuracy is the wrong metric to ship on, and why Anthropic's Building Effective Agents opens its eval discussion with the same arithmetic.

The independence assumption is not strictly true — failures do correlate, since a model that misreads the task at step one is more likely to misread it at step two — but for tool-call agents where each step queries a different system, the p^N approximation is close enough to be the right mental model. Multiple practitioners writing about retrieval and tool-use agents make the same observation: errors do not average out across steps; they accumulate.

Why per-step accuracy is the wrong metric

Most evals report per-step numbers because per-step numbers are what individual benchmarks measure. A retrieval benchmark scores recall@k on a single query. A tool-use benchmark scores the rate of correctly-formed function calls on a single turn. Both are useful, but neither tells you what an agent that chains five of those steps together will actually deliver to a user.

Consider a five-step booking agent that has to: parse the request, check both calendars, find overlap, draft the invite, and send it. If each step is independent at 95% accuracy, the end-to-end success rate is:

Steps (N)Per-step accuracyEnd-to-end success
195%95%
595%~77%
1095%~60%
2095%~36%

The same model, the same per-step quality — but the production story is wildly different at N=5 versus N=20. A demo that passes at one step looks great. The same model running a 20-step research loop fails roughly two-thirds of the time, and no per-step benchmark would have warned you.

The Stage 1 visualization in the right pane plots the curve directly: pick a per-step accuracy with the slider and watch end-to-end success drop as N grows. The shape is the lesson — the curve falls fastest in the regime where most production agents actually live (5–15 steps).

What compounding does to the eval question

Once you accept that end-to-end is the only metric that matches user experience, the next question shifts. It is no longer "how often does the model get this step right?" — it is "which step is the agent actually dropping on?" You cannot answer that from aggregate scores. You have to look at the traces.

That is the entry point for error analysis. The compounding math tells you the agent will fail; error analysis tells you where and why.

Try it: In Stage 1 of the right pane, slide per-step accuracy from 99% down to 90% and watch how the end-to-end curve at N=10 collapses. A model that drops just five percentage points of per-step accuracy can lose 30+ points of end-to-end success. The compounding is not subtle.

The next step works through the discipline that turns trace-watching into an actionable signal: error analysis first, evals second.

Error Analysis First

What is error analysis for AI agents?

Error analysis is the practice of reading a sample of real agent traces, labeling what failed and why, and clustering the failures into a small number of named categories — before writing any evals. The point is to ground the eval suite in the failure modes the agent actually exhibits, not the failure modes you imagine it might exhibit. The discipline was popularized for LLM applications by Hamel Husain and Shreya Shankar, whose joint workshops and field-tested writeups push the same line: error analysis first, evals second.

The order matters because it inverts the default workflow. The default is to brainstorm a list of plausible checks, write them as evals, run them on the agent, and report the score. The score goes up over time, and nobody notices that the agent's biggest real-world failure mode — say, silently truncating long search results — was never one of the checks. Error analysis prevents that by making the eval suite a response to observed evidence, not a guess.

Why error analysis comes before evals

Hamel Husain's field-tested AI evals course and Shreya Shankar's research on judge calibration both make the same case: an eval suite written from first principles tends to over-test the things that are easy to imagine and under-test the things that actually break. A 30-minute review of 20–50 real traces nearly always surfaces failure patterns that no a-priori checklist would have caught.

The workflow looks like this:

  1. Sample 20–50 real traces from the agent's actual usage — not synthetic, not curated.
  2. Read each trace end-to-end and write a short freeform note about what (if anything) went wrong.
  3. Cluster the notes into 3–7 named categories. The categories are your failure taxonomy.
  4. Now write evals — one targeting each category, sized to match how often that category fires.

The output of step 3 is the artifact. It is a small list — usually a single Notion or markdown page — that names the agent's actual failure modes in the user's words. Every subsequent eval decision flows from it.

The transition failure matrix

A useful diagnostic to layer on top of the trace review is the transition failure matrix. The rows are the last successful state the agent reached; the columns are the first failure state immediately after. For each failed trace, mark the cell. After 30–50 traces the hot cells point to the most-fixable transitions — typically things like "agent successfully retrieved documents → agent failed to cite them" or "agent successfully drafted plan → agent abandoned the plan after first tool error".

The matrix is more useful than a flat list because it surfaces where in the loop the failures cluster, which usually corresponds to a specific prompt section, tool definition, or stopping rule that needs work. Two agents with identical overall failure rates but different hot cells need different fixes.

The Stage 2 visualization in the right pane lets you label twelve example traces and watch the matrix fill in. As cells light up, the actionable failure modes become visible in a way that a flat accuracy number never makes them.

How many traces is enough?

The widely-cited rule of thumb is 20–50 traces for a first pass of error analysis on a working agent. Hamel Husain's writing puts the practical floor around 20: fewer than that and the categories you derive are dominated by noise. The ceiling depends on failure rate — if your agent fails on 5% of traces, you need a much larger sample to see enough failures to cluster meaningfully. Anthropic's Demystifying evaluations for AI agents makes the same point in different words: the right sample size is whatever it takes to see each failure mode at least three or four times, which for low-failure-rate agents can mean reading hundreds of traces.

Two practical considerations worth naming:

  • Read the successes too. A trace that "succeeded" but used 12 tool calls when 3 would have done is a soft failure your accuracy metric will not catch.
  • Re-run error analysis after every meaningful change. New prompt versions and new tool additions create new failure modes; the taxonomy decays.

Once the failure modes are named, the natural next move is to build the test set that exercises them. That is the role of golden cases.

Golden Cases

What are golden cases?

A golden case is a hand-curated test case that exercises a specific failure mode you found during error analysis. Each one carries an input, an expected behavior, and a deterministic pass/fail check. A golden suite is small — typically 10 to 30 cases — because the goal is regression detection on the failures that matter, not coverage of every possible input. The framing is consistent across practitioner writing on the topic, including Anthropic's Building Effective Agents and their follow-up Demystifying evaluations for AI agents: goldens are a forcing function that keep the agent honest as the prompt, model, or tools change.

The mental model is closer to a unit test suite than to a benchmark. A unit test catches the specific bug it was written for; a golden case catches the specific failure it was written for. You add new ones over time, retire ones that no longer apply, and run the suite on every change.

Why 20 cases beats 2,000

A common instinct on first encountering goldens is to aim for a large suite — hundreds or thousands of cases — under the assumption that more coverage is better. In practice, two effects push in the opposite direction.

First, maintenance cost scales with suite size. Every golden you add is something you have to re-curate when the agent's behavior changes intentionally. A 2,000-case suite that nobody updates turns into a wall of stale failures, which the team learns to ignore.

Second, signal density matters more than count. Twenty cases hand-picked from real failure modes will catch regressions that 2,000 randomly-sampled cases miss, because the random sample under-represents the long tail where bugs actually live. This is the same reason error analysis beats checklist-driven eval design: the failures you observe in production are not uniformly distributed across the input space.

The right size is enough to cover each named failure mode at least once or twice, with a few extras for the highest-impact categories. For most working agents that lands somewhere in the 10–30 range.

What goes in a golden case

Each golden carries three pieces:

  • Input — the user message, conversation context, or tool-call sequence that triggers the failure mode. Real inputs from production traces are best; minimally edited synthetic inputs are acceptable when privacy requires it.
  • Expected behavior — what the agent is supposed to do. Sometimes this is an exact output string; more often it is a property the output must satisfy ("calls the search tool before responding", "cites at least one source", "refuses the request").
  • Deterministic check — a function that takes the agent's output and returns pass or fail. Not a graded score, not an LLM judge; a deterministic predicate. If a check requires a judge, it belongs in a separate evaluation layer (covered in the next step).

The deterministic-check requirement is the part teams most often skip and most often regret. A golden whose check is "does this look right to a human?" cannot run in CI and cannot signal a regression without a human in the loop.

Catching silent regressions

The clearest illustration of why goldens earn their keep is the version-comparison case. Imagine three versions of the same agent prompt, evaluated against a 20-case golden suite:

VersionGoldens passingAverage user-rated quality
v1 (baseline)17 / 204.1
v219 / 204.3
v318 / 204.4

Average quality climbed from v1 to v3. But somewhere between v2 and v3 a golden that was passing started failing — and that golden encoded a real failure mode the team had specifically fixed two months ago. Without the suite, the regression is invisible: the average looks better than ever. With the suite, the failing case fires a CI alert and the team can decide whether the regression is acceptable before the change ships.

The Stage 3 visualization in the right pane makes the same point interactively: toggle between prompt versions and watch the per-case results. The aggregate looks fine; one specific golden flips red.

A pass/fail count is a clean signal for cases where success is binary. But many real agent behaviors are not binary — partial credit, judgement calls, graded outputs. That is where the choice of metric starts to matter.

Pass/Fail vs Score

What is the difference between pass@k and pass^k?

pass@k is the probability that at least one of k attempts at a task succeeds. pass^k is the probability that all k attempts succeed. For an agent with per-attempt success probability p:

  • pass@k = 1 − (1−p)^k — the success metric, generous
  • pass^k = p^k — the consistency metric, strict

The pass@k formulation comes from OpenAI's Codex paper (Chen et al., 2021), which used it to evaluate code-generation models where any of several sampled completions could solve a problem. That framing makes sense for offline benchmarks where humans pick the best output. It is the wrong framing for production agents that have to act autonomously, because in production the user does not get to pick the lucky run — they get whichever run happened. The pass^k framing has been emphasized more recently in production-evals discussions, with practitioners like Hamel Husain and others pointing out that consistency is the bar that actually predicts user-experienced reliability.

Why pass@k flatters your agent

pass@k is mathematically optimistic by construction. As k grows, pass@k approaches 1 for any p greater than 0 — the formula is essentially a coupon-collector argument: with enough draws, even a 10%-reliable agent will eventually produce a good answer. That is great for benchmark leaderboards and for tasks where you can afford to sample many completions and hand-pick. It is misleading for any deployment where the user only ever sees one output.

Worse, pass@k saturates near 100% and stops separating agents long before reliability is production-grade. Two agents:

  • Agent A: succeeds 70% of the time per attempt — pass@5 = 99.8%, pass^5 = 16.8%
  • Agent B: succeeds 80% of the time per attempt — pass@5 = 99.97%, pass^5 = 32.8%

Both look indistinguishable by pass@5 (both round to ~100%). The pass^5 numbers separate them clearly — and they tell the production-relevant story: how often does the agent finish a multi-attempt or multi-user-session sequence cleanly, every time?

When to use pass@k vs pass^k

The two metrics are not in opposition; they answer different questions and both have a place.

Use pass@k when…Use pass^k when…
You can sample many completions and pick the bestThe user sees one autonomous run, no selection
A human or verifier filters the output before deliveryThe agent acts on its own output (writes files, sends emails)
The task tolerates multiple plausible outputsThe task requires the same correct answer every time
Research benchmarks and capability evalsProduction reliability and consistency bars

A simple heuristic: if shipping the agent means a user gets exactly one answer, you should be reporting pass^k, not pass@k. The fact that you could re-roll five times and one of them would have been good is not relevant to the user who got the bad one.

The shippability gap

Walking through one example makes the gap concrete. Take a per-attempt success probability of p = 0.8 (the agent gets it right 80% of the time, which sounds reasonable) and k = 5 attempts (a five-step task, or a five-user-session day):

  • pass@5 = 1 − (1 − 0.8)^5 = 1 − 0.2^5 = 1 − 0.00032 = 99.97%
  • pass^5 = 0.8^5 = 32.8%

Same model, same per-attempt accuracy. One number rounds to 100% and looks shippable. The other says the agent finishes a five-step task cleanly only about a third of the time.

The Stage 4 visualization in the right pane lets you slide p and k and watch the gap between the two big numbers widen. The two metrics agree only in two corners of the parameter space: when p is very high (the agent is essentially never wrong) or when k is 1 (single-attempt task). Everywhere else — which is most of where production agents live — the gap between pass@k and pass^k is the gap between demo-ready and user-ready.

Even with the right metric, the eval pipeline can still mislead. The next step works through four specific failure modes that defeat eval discipline even when the math is correct.

The 4 Eval Failure Modes

What are eval failure modes?

Eval failure modes are the ways an eval suite stops being a trustworthy signal even while it appears to keep working. The score keeps moving, the dashboards keep updating, and the team keeps shipping changes — but the number no longer means what the team thinks it means. Anthropic's Demystifying evaluations for AI agents groups several of these under the broader heading of eval hygiene; the four below are the ones that come up most often in practitioner writing.

The four are: validator validation failure, Goodhart's law, eval contamination, and sample-size denial. Each one breaks the eval loop in a different way, and each one needs a different kind of discipline to defend against.

1. Validator validation failure

When evals use an LLM to grade another LLM's outputs (the so-called LLM-as-judge pattern), the grader inherits the same failure modes as the model being graded. A judge that misreads ambiguous instructions will systematically mis-grade outputs that depend on those instructions. A judge that hallucinates citations will accept hallucinated citations from the agent. The grader is not a neutral oracle; it is another fallible model wearing the costume of one.

Shreya Shankar's research on judge calibration is the clearest treatment of the problem. The practical defense is validator validation: hold out a labeled set of agent outputs that humans have graded, then run the LLM judge against the same set and measure agreement. Below roughly 90% agreement with the human labels, the judge is too noisy to trust as a primary signal. The rule of thumb is approximate — the right threshold depends on how high-stakes the eval is — but the discipline is non-negotiable: never deploy an LLM judge without first measuring it against a human ground truth.

The validator-validation step is exactly the kind of work teams skip under deadline pressure, which is exactly why it tends to be the eval pipeline's weakest link.

2. Goodhart's law

Goodhart's law, in Marilyn Strathern's common phrasing, states: "When a measure becomes a target, it ceases to be a good measure." Applied to evals: once the team starts optimizing the prompt, the model, or the tools to make the eval score go up, the score stops measuring what it originally measured. The agent gets better at passing the eval and not necessarily better at the underlying task the eval was a proxy for.

The clinical signature is score climbs while real performance plateaus or drops. A version ships that scores 8% better on the eval suite and gets a flood of user complaints in the same week. The eval was measuring "agent uses citation-style language"; the team fine-tuned the prompt to lean harder on citations; the agent now cites enthusiastically and inaccurately, which the eval did not check.

The defense is to keep a held-out diagnostic set that the team never optimizes against — a small bank of cases used only to spot-check whether real quality tracks the dashboard score. When the two diverge, Goodhart has hit, and it is time to change the eval rather than the agent.

3. Eval contamination

Eval contamination is when the test data leaks into the training pipeline, the system prompt, or the few-shot examples. The model has seen the answer key; the eval is now a memorization measurement, not a generalization measurement.

Contamination shows up in several concrete forms:

  • Eval questions in training data. Public benchmarks, especially older ones, are often scraped into pretraining corpora. The model has memorized them.
  • Eval cases in the system prompt. Teams sometimes paste examples into the prompt during development and forget to remove them; the agent then "passes" the eval cases it was told the answers to.
  • Eval cases in few-shot context. When few-shot examples are pulled from a vector store that also contains the eval set, the agent sees its test cases as in-context demonstrations.

The defense is to keep the eval set out of every pipeline — out of training data, out of the prompt, out of the retrieval index. Anthropic's writing on the subject emphasizes the same discipline: treat eval data like test data in any other engineering context — strictly quarantined from the system being tested.

4. Sample-size denial

The last failure mode is the simplest: drawing strong conclusions from too few cases. A run of n = 10 showing "+2% improvement" is, statistically, indistinguishable from noise. Two percentage points on ten cases is one extra pass — well within the normal flap of any non-deterministic system.

A worked comparison makes the threshold concrete. Suppose the true success rate is 80%:

Sample size (n)Approximate 95% confidence interval
10~55% – 100%
50~70% – 90%
200~75% – 85%
1,000~78% – 82%

At n=10, a "2% improvement" sits well inside the noise floor. At n=1,000, the same improvement is meaningful. The intervals above are illustrative — exact widths depend on the underlying distribution — but the shape of the table is the lesson: small samples are wide, large samples are narrow, and any reported delta should come with the n and the interval attached.

The defense is to always report sample size and confidence intervals, and to refuse to celebrate (or panic over) any change that sits inside the noise band.

Bridging to security

These four failure modes share a structure: in each one, the eval score is intact and the underlying signal is corrupted. Defending against them is the operational side of treating evals as infrastructure rather than reports.

Evals tell you the agent works correctly — that it produces the answers you asked it to produce. The next module covers the complementary question: whether the agent can be hijacked into producing answers you did not ask it to produce. That is the security trifecta, and it is the reliability layer that has to be solved before any agent goes anywhere near untrusted input.

Try it: Walk through the four cards in Stage 5 of the right pane. Each card has its own tiny demo: the validator card flips human-vs-LLM grades and surfaces the agreement rate; the Goodhart card animates an eval score climbing while real quality flattens or drops; the contamination card scrubs the leaked answer from a prompt and watches the score collapse; the sample-size card toggles between n=10 and n=1000 and re-runs to show how wildly small samples swing.

Frequently Asked Questions

© 2026 Learn AI Visuallycraftsman@craftsmanapps.com