ACC — tool-output unmasking turns agent trajectories into long-context QA

LLM
L
Standard SFTmulti-turn trajectory(tool outputs maskedfrom the loss)USEREV sales Q1?ASSIST<think>loss ✓CALLsearch(...)OUTPUT ~800maskedno lossASSIST<think>loss ✓CALLsearch(...)OUTPUT ~600maskedno lossANSWER≈ 3.2Mloss ✓same content, different SFT formattool outputs go from masked → unmasked contextACC SFTlong-context QA pair(tool outputs becomeinput context)QUERYEV sales Q1?CONTEXT (~1400 tok — unmasked tool outputs concatenated)[1400 tokens — both tool outputs concatenated, fully visible]ANSWER≈ 3.2Mloss targetLong-context signalstandard SFT18% coverageACC SFT15% coverage
learnaivisually.com/ai-explained/acc-tool-output-unmasking

The news. On May 21, 2026, a USTC team posted a paper showing that the same agent trajectory data can be turned into long-context training pairs by unmasking the tool outputs. On the recipe — they call it ACC — Qwen3-30B-A3B gains +18.1 points on MRCR (long-context retrieval) and +7.6 on GraphWalks, matching Qwen3-235B-A22B at roughly 8× fewer parameters, while general capability on GPQA, MMLU-Pro, AIME, and IFEval is preserved. Read the paper →

Picture a math textbook with a strange layout. The question is on the first page — "How many electric vehicle units sold in Q1 2026?" — and the final answer is on the last page — "approximately 3.2 million units." In between, where the worked solution should be, every line is blacked out. A student studying this textbook sees that a question maps to an answer, but never learns the chain of reasoning that connects them. That is exactly the shape of a standard agent SFT trajectory: the assistant turns and the final answer carry training signal, but the tool outputs in between — search results, code outputs, file contents — are masked from the loss because they were emitted by the environment, not the model.

The ACC paper's move is to un-redact the textbook. It takes each trajectory — query, every intermediate assistant turn, every tool call and every tool output — strips out the multi-turn role structure, and emits a single supervised pair: (query + assembled context → final answer). The tool outputs are no longer masked; they are now part of the input the model conditions on, with the answer-loss attending back over the full evidence chain. The model is forced to integrate distant pieces of evidence end-to-end, exactly the skill that long-context retrieval demands. Crucially, no new annotation is needed — the data already exists in any team's agent-trajectory log.

Where the long-context signal actually comes from

Suppose an agent answers "How did EV sales shift in Q1 2026?" with two search calls. The first tool output returns ~800 tokens of news context; the second returns ~600 tokens of regional breakdown. In standard SFT, those 1,400 tokens of tool output are excluded from the loss — the model sees them while generating but never learns to draw evidence from them. ACC strips the role structure and emits a single training pair: [user query | 1,400 tokens of assembled context → final answer]. The loss now propagates back through the full 1,400-token attention chain, teaching the model to integrate distant evidence end-to-end. (Illustrative — the paper does not publish per-trajectory token budgets, but the numbers above are within the range typical of real agent traces.)

The benchmark deltas put the recipe's value in concrete terms. On MRCR, Qwen3-30B-A3B climbs from its base score by +18.1 points to 68.3 after ACC SFT. On GraphWalks, it climbs +7.6 points to 77.5. The most striking comparison is parameter efficiency: Qwen3-30B-A3B after ACC matches Qwen3-235B-A22B on the same retrieval benchmark — roughly 8× fewer parameters for the same long-context behavior. General capability is preserved on GPQA, MMLU-Pro, AIME, and IFEval, which matters because aggressive training-data reshaping often costs accuracy on out-of-distribution evals.

How ACC compares to other ways to teach long-context

ApproachData sourceWhat the loss attends toCost to acquire
Standard agent SFTMulti-turn agent trajectoriesAssistant tokens only; tool outputs maskedAlready in your trajectory log
Synthetic long-context QA (e.g., needle-in-haystack)Hand-constructed or LLM-synthesizedAnswer tokens attending over a synthetic long inputNew annotation pipeline + quality risk
ACC (this paper)Same agent trajectories, unmasked and flattenedAnswer tokens attending over the full unmasked evidence chainZero new annotation — re-uses existing logs

The structural takeaway lands in two places at once. For an LLM training team, ACC is a near-free upgrade to long-context behavior: the data is already in the log, only the format changes. For an agent-engineering team, it reframes what an agent trajectory is for — not just a record of how the agent solved one task, but a long-context QA exemplar for the next round of base-model training. The same trace that drove a single agent run becomes the training data that teaches the next model to handle the kind of long-context retrieval the agent was doing the slow way with tool calls.

Goes deeper in: AI Agents → Context Engineering → Context as Scarce Resource — the lens that explains why long-context retrieval is the throughput bottleneck ACC is trying to lift in the base model.

Related explainers

Frequently Asked Questions