TELBench / DRIFT — Span-level error localization

Agent
L
Same wrong answer — final-answer eval vs. span-level localizationA deep-research agent's 7-step trajectory ends in a wrong answer1parse task2web search3open source4read table5cross-check6synthesize7final answerfirst error spanDRIFT scanFirst-error localization accuracyfinal-answer eval (baseline)DRIFT +30 ppDon't just grade the answer — find the step✗ answerstep 4 localizedfirst-error accuracy +30 ppTELBench: 1,000 instances distilled from 2,790 real agent trajectories
learnaivisually.com/ai-explained/telbench-span-level-error-localization

The news. On June 1, 2026, researchers posted TELBench and DRIFT to arXiv. TELBench is a 1,000-instance benchmark distilled from 2,790 real agent trajectories; DRIFT is a method that localizes the offending span of a deep-research run — predicting where the first error occurs — rather than only judging the final answer. The paper reports DRIFT improving both span-level error localization and first-error accuracy by up to 30 percentage points over baselines. Read the paper →

Picture the GPS replaying your drive. You set out for the right town and ended up in the wrong one — that is all a final-answer grader knows: wrong destination, trip failed. It cannot hand you the fix, because the fix is not at the destination. Somewhere back along the route you took one wrong turn, and every mile after it — however carefully driven — only carried you deeper into the mistake. What you actually want is the replay that rewinds the whole drive and circles the exact turn where you first left the route. That circled turn is a span, and naming it is what span-level error localization does for an agent.

A deep-research agent fails the same way. It runs a long trajectory — parse the task, search, open a source, read a table, cross-check, synthesize, answer — and one early slip (say, a figure misread from a table) quietly poisons everything downstream. Grade only the final answer and you get one bit: ✗. You know the run failed; you have no idea whether the bug is in the search query, the retrieval, the table parser, or the synthesis prompt. That is the error-analysis gap the Evals & Diagnostics module keeps hammering: you cannot fix what you cannot locate. The reason it bites here is that agent failures compound — a small early mistake snowballs across the remaining steps, so the answer is wrong for a reason that lives far from the answer.

DRIFT is the replay engine. Instead of re-scoring the output, it walks the trajectory and predicts the first-error span — the earliest step whose mistake the rest of the run inherits. In the GPS picture it does not just confirm you are in the wrong town; it rewinds to the intersection and circles it. That turns a verdict into an action: the localized step is a span you can replay, a root cause you can attach to the right tool or prompt, and a regression you can guard against next time.

Evaluation granularityWhat it returnsWhat it can't tell you
Final-answer evalpass / fail on the outputwhich step broke the run
LLM-judge rubrica score + a rationalestill answer-level — no step pointer
Span-level localization (DRIFT)the offending span + first-error step(up to +30 pp first-error accuracy vs baselines, paper)

Here is why the localization signal is so much richer than the grade. TELBench's real scale is 1,000 instances distilled from 2,790 trajectories. Now picture one illustrative failing run of a dozen steps. A final-answer grader emits exactly one bit for it — failed — and says nothing about which of those dozen steps went wrong. A localizer has to do something far harder: point at the one right step out of the twelve, where blind guessing would land the first-error step only about one time in twelve (≈ 8%). Against that difficulty, DRIFT's reported +30 percentage points in first-error accuracy is the whole story — enough to lift a localizer from, say, a 45% baseline to roughly 75% (only the +30 pp and the 1,000 / 2,790 counts come from the paper; the dozen-step run and the 45% → 75% endpoints are illustrative). The grade was one bit; the localized span is a repair ticket.

Goes deeper in: AI Agents → Evals & Diagnostics → Error analysis first

Related explainers

Continue in trackEvals & Diagnostics: Error analysis first

Frequently Asked Questions