ACC — tool-output unmasking turns agent trajectories into long-context QA
LLMThe news. On May 21, 2026, a USTC team posted a paper showing that the same agent trajectory data can be turned into long-context training pairs by unmasking the tool outputs. On the recipe — they call it ACC — Qwen3-30B-A3B gains +18.1 points on MRCR (long-context retrieval) and +7.6 on GraphWalks, matching Qwen3-235B-A22B at roughly 8× fewer parameters, while general capability on GPQA, MMLU-Pro, AIME, and IFEval is preserved. Read the paper →
Picture a math textbook with a strange layout. The question is on the first page — "How many electric vehicle units sold in Q1 2026?" — and the final answer is on the last page — "approximately 3.2 million units." In between, where the worked solution should be, every line is blacked out. A student studying this textbook sees that a question maps to an answer, but never learns the chain of reasoning that connects them. That is exactly the shape of a standard agent SFT trajectory: the assistant turns and the final answer carry training signal, but the tool outputs in between — search results, code outputs, file contents — are masked from the loss because they were emitted by the environment, not the model.
The ACC paper's move is to un-redact the textbook. It takes each trajectory — query, every intermediate assistant turn, every tool call and every tool output — strips out the multi-turn role structure, and emits a single supervised pair: (query + assembled context → final answer). The tool outputs are no longer masked; they are now part of the input the model conditions on, with the answer-loss attending back over the full evidence chain. The model is forced to integrate distant pieces of evidence end-to-end, exactly the skill that long-context retrieval demands. Crucially, no new annotation is needed — the data already exists in any team's agent-trajectory log.
Where the long-context signal actually comes from
Suppose an agent answers "How did EV sales shift in Q1 2026?" with two search calls. The first tool output returns ~800 tokens of news context; the second returns ~600 tokens of regional breakdown. In standard SFT, those 1,400 tokens of tool output are excluded from the loss — the model sees them while generating but never learns to draw evidence from them. ACC strips the role structure and emits a single training pair: [user query | 1,400 tokens of assembled context → final answer]. The loss now propagates back through the full 1,400-token attention chain, teaching the model to integrate distant evidence end-to-end. (Illustrative — the paper does not publish per-trajectory token budgets, but the numbers above are within the range typical of real agent traces.)
The benchmark deltas put the recipe's value in concrete terms. On MRCR, Qwen3-30B-A3B climbs from its base score by +18.1 points to 68.3 after ACC SFT. On GraphWalks, it climbs +7.6 points to 77.5. The most striking comparison is parameter efficiency: Qwen3-30B-A3B after ACC matches Qwen3-235B-A22B on the same retrieval benchmark — roughly 8× fewer parameters for the same long-context behavior. General capability is preserved on GPQA, MMLU-Pro, AIME, and IFEval, which matters because aggressive training-data reshaping often costs accuracy on out-of-distribution evals.
How ACC compares to other ways to teach long-context
| Approach | Data source | What the loss attends to | Cost to acquire |
|---|---|---|---|
| Standard agent SFT | Multi-turn agent trajectories | Assistant tokens only; tool outputs masked | Already in your trajectory log |
| Synthetic long-context QA (e.g., needle-in-haystack) | Hand-constructed or LLM-synthesized | Answer tokens attending over a synthetic long input | New annotation pipeline + quality risk |
| ACC (this paper) | Same agent trajectories, unmasked and flattened | Answer tokens attending over the full unmasked evidence chain | Zero new annotation — re-uses existing logs |
The structural takeaway lands in two places at once. For an LLM training team, ACC is a near-free upgrade to long-context behavior: the data is already in the log, only the format changes. For an agent-engineering team, it reframes what an agent trajectory is for — not just a record of how the agent solved one task, but a long-context QA exemplar for the next round of base-model training. The same trace that drove a single agent run becomes the training data that teaches the next model to handle the kind of long-context retrieval the agent was doing the slow way with tool calls.
Goes deeper in: AI Agents → Context Engineering → Context as Scarce Resource — the lens that explains why long-context retrieval is the throughput bottleneck ACC is trying to lift in the base model.
Related explainers
- EnvFactory — synthetic envs for tool-use agent training — the other end of the agent-SFT data pipeline: generating the trajectories ACC then reformats
- FutureSim — harness-level agent eval vs single-shot QA — the evaluation surface that shows whether ACC-style training data actually transfers to multi-turn behavior
- TIM — Training-Inference Mismatch in RL — a different SFT-format failure mode: training data that does not match the distribution the model sees at inference