What is span-level error localization?

Span-level error localization attributes an agent's failure to a specific span — one step or a contiguous segment of its trajectory — instead of grading only the final answer. For a deep-research agent that searches, reads, and synthesizes across many steps, it points at the step where the answer first became unreliable, so the bug can be tied to a concrete tool, retrieval, or prompt rather than to the run as a whole. TELBench and DRIFT, introduced in June 2026, benchmark and perform this localization.

How is DRIFT different from a normal pass/fail eval?

A pass/fail eval collapses an entire multi-step run into one bit — it tells you the agent failed but not where. DRIFT instead inspects the trajectory and predicts the first-error span: the earliest step whose mistake every later step inherits. The paper reports it improving span-level and first-error localization accuracy by up to 30 percentage points over baselines, turning a verdict into an actionable pointer at the offending step.

Why does localizing the first error matter more than the final answer?

Agent failures compound: a single early mistake — a misread figure, a bad search result — poisons every downstream step, so the wrong answer is wrong for a reason that lives far from the answer. Localizing the first error names the root cause, which is what you actually need to fix the harness or tool and to guard against the same regression. It maps directly onto the Evals & Diagnostics principle that you cannot fix what you cannot locate.

TELBench localizes where deep-research agents go wrong — Span-level error localization

TELBench / DRIFT — Span-level error localization

Agent

learnaivisually.com/ai-explained/telbench-span-level-error-localization

TL;DR

What is it: A new benchmark paper introduces TELBench and DRIFT — a 1,000-instance benchmark plus a method that performs span-level error localization: instead of grading only an agent's final answer, it points at the specific step in the agent's trajectory where the answer first went wrong.
Why it’s needed: Deep-research agents fail through long trajectories of search, tool calls, and synthesis, and a pass/fail on the final answer tells you that it failed but never where — so you can't fix the harness, the tool, or the prompt that actually broke.
vs previous: It moves past final-answer-only evaluation (one ✗ for the whole run) to naming the first-error span, improving span-level and first-error localization accuracy by up to 30 percentage points over the baselines the paper tests.

Jargon

Deep-research agent: An agent that answers a hard question by running a long loop — searching the web, opening sources, inspecting evidence, calling tools, and synthesizing — rather than replying in one shot. The full sequence of those moves is its trajectory. See AI Agents → Agent Loop & State.
Trajectory: The ordered record of every step an agent took — each search, tool call, observation, and intermediate thought. It is the thing you replay when debugging, and the unit observability tools chop into spans.
Final-answer evaluation: Scoring only the agent's last output — pass or fail against a reference. Cheap and unambiguous, but it collapses a whole multi-step run into one bit and hides which step caused the failure. This is the pass/fail limitation the paper attacks.
Span-level error localization: Attributing an agent's failure to a specific span — one step or a contiguous segment of the trajectory — instead of to the answer as a whole. The output is a pointer, not a grade.
First error: The earliest step that introduced an unrecoverable mistake. It matters most because every later step inherits it, so first-error accuracy — did the localizer name the right earliest step — is the headline metric.
DRIFT: The paper's localization method. It inspects the trajectory and predicts the offending span and the first-error step, rather than re-scoring the final answer. TELBench is the labeled benchmark used to train and evaluate it.

The news. On June 1, 2026, researchers posted TELBench and DRIFT to arXiv. TELBench is a 1,000-instance benchmark distilled from 2,790 real agent trajectories; DRIFT is a method that localizes the offending span of a deep-research run — predicting where the first error occurs — rather than only judging the final answer. The paper reports DRIFT improving both span-level error localization and first-error accuracy by up to 30 percentage points over baselines. Read the paper →

Picture the GPS replaying your drive. You set out for the right town and ended up in the wrong one — that is all a final-answer grader knows: wrong destination, trip failed. It cannot hand you the fix, because the fix is not at the destination. Somewhere back along the route you took one wrong turn, and every mile after it — however carefully driven — only carried you deeper into the mistake. What you actually want is the replay that rewinds the whole drive and circles the exact turn where you first left the route. That circled turn is a span, and naming it is what span-level error localization does for an agent.

A deep-research agent fails the same way. It runs a long trajectory — parse the task, search, open a source, read a table, cross-check, synthesize, answer — and one early slip (say, a figure misread from a table) quietly poisons everything downstream. Grade only the final answer and you get one bit: ✗. You know the run failed; you have no idea whether the bug is in the search query, the retrieval, the table parser, or the synthesis prompt. That is the error-analysis gap the Evals & Diagnostics module keeps hammering: you cannot fix what you cannot locate. The reason it bites here is that agent failures compound — a small early mistake snowballs across the remaining steps, so the answer is wrong for a reason that lives far from the answer.

DRIFT is the replay engine. Instead of re-scoring the output, it walks the trajectory and predicts the first-error span — the earliest step whose mistake the rest of the run inherits. In the GPS picture it does not just confirm you are in the wrong town; it rewinds to the intersection and circles it. That turns a verdict into an action: the localized step is a span you can replay, a root cause you can attach to the right tool or prompt, and a regression you can guard against next time.

Evaluation granularity	What it returns	What it can't tell you
Final-answer eval	pass / fail on the output	which step broke the run
LLM-judge rubric	a score + a rationale	still answer-level — no step pointer
Span-level localization (DRIFT)	the offending span + first-error step	— (up to +30 pp first-error accuracy vs baselines, paper)

Here is why the localization signal is so much richer than the grade. TELBench's real scale is 1,000 instances distilled from 2,790 trajectories. Now picture one illustrative failing run of a dozen steps. A final-answer grader emits exactly one bit for it — failed — and says nothing about which of those dozen steps went wrong. A localizer has to do something far harder: point at the one right step out of the twelve, where blind guessing would land the first-error step only about one time in twelve (≈ 8%). Against that difficulty, DRIFT's reported +30 percentage points in first-error accuracy is the whole story — enough to lift a localizer from, say, a 45% baseline to roughly 75% (only the +30 pp and the 1,000 / 2,790 counts come from the paper; the dozen-step run and the 45% → 75% endpoints are illustrative). The grade was one bit; the localized span is a repair ticket.

Goes deeper in: AI Agents → Evals & Diagnostics → Error analysis first

Related explainers

Agent-harness scaling law — Effective Feedback Compute (EFC) — the flip side: feedback that localizes the error is what actually predicts agent success
FutureSim — harness-level agent eval — why the trend is evaluating the agent's process, not just its final answer
PushBench — Quantitative Goal Persistence (QGP) — another move from one final number to a richer, step-aware reliability signal

Continue in trackEvals & Diagnostics: Error analysis first

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based