MSR delegation study — Cascading fidelity loss over 20 iterations

Agent
L
20 delegated LLM edits compound: 19–34% fidelity loss by iteration 20Iter 0Iter 5Iter 10Iter 15Iter 20Artifact fidelity over delegated iterations60%70%80%90%100%05101520Artifact fidelityDelegated iteration →Strong models — −19% by iter 20Same models, worst case — −34%Original artifactIter 20 — drifted−19%strong models−34%worst case
learnaivisually.com/ai-explained/msr-delegation-fidelity-drift

The news. On May 15, 2026, Microsoft Research's Philippe Laban, Tobias Schnabel, and Jen Neville posted a follow-up to their delegation paper clarifying what their headline number does and doesn't say. The headline: state-of-the-art frontier LLMs degrade artifact fidelity by roughly 19–34% over 20 successive delegated edits of multi-step documents. The follow-up's main message is calibration — this is a stress-test setup with constrained in-loop human verification, and real production systems include verification layers, oversight, and human-in-the-loop checkpoints that mitigate cumulative drift.

Picture a children's game of telephone, but with 20 children in the line instead of 5. The first child whispers a sentence cleanly. The second child mishears one word and replaces it with a near-rhyme. The third child smooths the awkward phrase into something more natural. By the tenth child, the sentence is grammatical but no longer about the same thing; by the twentieth, the meaning has drifted enough that the first child can't quite recognize their own message. No single hand-off introduced a large error. The error compounded because no one was allowed to compare back to the original in the loop.

The Microsoft Research setup is that game, formalized. Each "child" is a strong frontier LLM given the previous iteration's document plus an editing instruction; each iteration produces a new document; the chain is run with constrained in-loop verification against the iteration-0 artifact. Fidelity is scored at the end by comparing iteration 20 against iteration 0 along structural and semantic axes. The authors run this across 20 delegated iterations to isolate the compounding-drift signal — the question isn't "can frontier models edit a document?" (they can, fine), it's "how fast does the chain of edits diverge from the starting point when in-loop human verification is constrained?"

Where this earns its keep is a worked example with named numbers (illustrative — the paper publishes the aggregate 19–34% drop on multi-step documents; the per-step decomposition below uses the rounded headline figures, not paper-internal per-iteration tables). Suppose a model has a ~1.5% per-step "drift contribution" to artifact fidelity on a multi-step document. After one round, fidelity sits near 98.5% — easily mistaken for noise. After 20 rounds, naive multiplication gives 0.985^20 ≈ 0.74 — about −26% fidelity, comfortably inside the paper's 19–34% range. Push the per-step contribution to 2% and you land at 0.98^20 ≈ 0.67, near the worst-case end of the band. Push it down to 1% and you land at 0.99^20 ≈ 0.82, near the strong-model end. The headline number isn't a model-quality claim about iteration 1 — it's a claim about how a small per-step contribution compounds over 20 rounds when in-loop verification is constrained. It rhymes with the compounding-error analysis any production agent harness has to reckon with.

Where this sits next to other long-horizon agent evals

Eval styleChain depthIn-loop verificationWhat it measures
Single-turn benchmarks (MMLU Hendrycks et al. 2020, HumanEval Chen et al. 2021)1not applicableFirst-shot answer quality on a single prompt
Bounded agent benchmarks (e.g. AgentBench-style harness evals; bounded turn counts vary by benchmark)typically ~3–10, varies by taskoptional — variesMulti-turn success at a fixed horizon
MSR delegated-edit stress test (this paper)20constrained — deliberately limited in-loop human verificationCumulative fidelity drift when in-loop checks against iter 0 are constrained
Production agent (as the authors note)varies by workloadlayered — pre/post-call checks, retrieval-grounded validation, human reviewEnd-to-end task success with engineered drift mitigation

The clarifying post pushes back on a misreading of the headline. Reading "−34% by iteration 20" as "frontier models can't edit documents reliably" overstates what the experiment shows. The setup intentionally constrains the parts of a real agent stack that exist specifically because compounding drift is a known problem — output validators, schema checks, retrieval grounding, and human-in-the-loop review. The paper's contribution is to put a number on the failure mode that those layers are paid to mitigate, so that designers of production agent harnesses can argue from the size of the drift signal rather than the existence of it. A 20-round chain with constrained in-loop verification is the worst case the verification stack is designed against, not the typical case the verification stack is designed for.

The implication for postmortems on long-horizon agent failures is to look for drift first, not capability second. If an agent produces a clearly-wrong artifact on a long task, the failure mode is more often "the chain of edits drifted and no checkpoint caught it" than "the first call got the answer wrong." This reframes the engineering question from "can we make the model better at the task?" to "can we keep the in-loop artifact anchored to the original spec for long enough?" — and points toward concrete designs: per-round retrieval against the starting artifact, mid-chain validators that compare to a stored reference, and bounded-horizon agent designs that reset state before drift accumulates.

Goes deeper in: AI Agents → Evals & Diagnostics → Compounding errors and Agent Engineering → Incident Handling → Postmortem discipline

Related explainers

Frequently Asked Questions