MSR delegation study — Cascading fidelity loss over 20 iterations
AgentThe news. On May 15, 2026, Microsoft Research's Philippe Laban, Tobias Schnabel, and Jen Neville posted a follow-up to their delegation paper clarifying what their headline number does and doesn't say. The headline: state-of-the-art frontier LLMs degrade artifact fidelity by roughly 19–34% over 20 successive delegated edits of multi-step documents. The follow-up's main message is calibration — this is a stress-test setup with constrained in-loop human verification, and real production systems include verification layers, oversight, and human-in-the-loop checkpoints that mitigate cumulative drift.
Picture a children's game of telephone, but with 20 children in the line instead of 5. The first child whispers a sentence cleanly. The second child mishears one word and replaces it with a near-rhyme. The third child smooths the awkward phrase into something more natural. By the tenth child, the sentence is grammatical but no longer about the same thing; by the twentieth, the meaning has drifted enough that the first child can't quite recognize their own message. No single hand-off introduced a large error. The error compounded because no one was allowed to compare back to the original in the loop.
The Microsoft Research setup is that game, formalized. Each "child" is a strong frontier LLM given the previous iteration's document plus an editing instruction; each iteration produces a new document; the chain is run with constrained in-loop verification against the iteration-0 artifact. Fidelity is scored at the end by comparing iteration 20 against iteration 0 along structural and semantic axes. The authors run this across 20 delegated iterations to isolate the compounding-drift signal — the question isn't "can frontier models edit a document?" (they can, fine), it's "how fast does the chain of edits diverge from the starting point when in-loop human verification is constrained?"
Where this earns its keep is a worked example with named numbers (illustrative — the paper publishes the aggregate 19–34% drop on multi-step documents; the per-step decomposition below uses the rounded headline figures, not paper-internal per-iteration tables). Suppose a model has a ~1.5% per-step "drift contribution" to artifact fidelity on a multi-step document. After one round, fidelity sits near 98.5% — easily mistaken for noise. After 20 rounds, naive multiplication gives 0.985^20 ≈ 0.74 — about −26% fidelity, comfortably inside the paper's 19–34% range. Push the per-step contribution to 2% and you land at 0.98^20 ≈ 0.67, near the worst-case end of the band. Push it down to 1% and you land at 0.99^20 ≈ 0.82, near the strong-model end. The headline number isn't a model-quality claim about iteration 1 — it's a claim about how a small per-step contribution compounds over 20 rounds when in-loop verification is constrained. It rhymes with the compounding-error analysis any production agent harness has to reckon with.
Where this sits next to other long-horizon agent evals
| Eval style | Chain depth | In-loop verification | What it measures |
|---|---|---|---|
| Single-turn benchmarks (MMLU Hendrycks et al. 2020, HumanEval Chen et al. 2021) | 1 | not applicable | First-shot answer quality on a single prompt |
| Bounded agent benchmarks (e.g. AgentBench-style harness evals; bounded turn counts vary by benchmark) | typically ~3–10, varies by task | optional — varies | Multi-turn success at a fixed horizon |
| MSR delegated-edit stress test (this paper) | 20 | constrained — deliberately limited in-loop human verification | Cumulative fidelity drift when in-loop checks against iter 0 are constrained |
| Production agent (as the authors note) | varies by workload | layered — pre/post-call checks, retrieval-grounded validation, human review | End-to-end task success with engineered drift mitigation |
The clarifying post pushes back on a misreading of the headline. Reading "−34% by iteration 20" as "frontier models can't edit documents reliably" overstates what the experiment shows. The setup intentionally constrains the parts of a real agent stack that exist specifically because compounding drift is a known problem — output validators, schema checks, retrieval grounding, and human-in-the-loop review. The paper's contribution is to put a number on the failure mode that those layers are paid to mitigate, so that designers of production agent harnesses can argue from the size of the drift signal rather than the existence of it. A 20-round chain with constrained in-loop verification is the worst case the verification stack is designed against, not the typical case the verification stack is designed for.
The implication for postmortems on long-horizon agent failures is to look for drift first, not capability second. If an agent produces a clearly-wrong artifact on a long task, the failure mode is more often "the chain of edits drifted and no checkpoint caught it" than "the first call got the answer wrong." This reframes the engineering question from "can we make the model better at the task?" to "can we keep the in-loop artifact anchored to the original spec for long enough?" — and points toward concrete designs: per-round retrieval against the starting artifact, mid-chain validators that compare to a stored reference, and bounded-horizon agent designs that reset state before drift accumulates.
Goes deeper in: AI Agents → Evals & Diagnostics → Compounding errors and Agent Engineering → Incident Handling → Postmortem discipline
Related explainers
- FutureSim — Harness-level agent eval vs single-shot QA — a different angle on the same gap: single-shot QA dramatically overstates agent capability versus realistic harnessed evaluation
- Grep vs vector — Agentic retrieval study — the harness-and-tool-call story dominates the model story, complementary to "the chain dominates the per-step model"
- AsyncFC — Symbolic futures in the decode stream — a serving-side mechanism that keeps the agent loop ticking; orthogonal to fidelity drift but lives in the same long-running-loop regime