What is ACC tool-output unmasking?

ACC is the recipe from a USTC paper for reformatting multi-turn agent trajectories into long-context QA pairs by unmasking tool outputs. In standard agent SFT, tool outputs are excluded from the loss — the model reads them in context but never learns to generate from them. ACC strips the multi-turn role structure and emits a single (query + assembled context → final answer) supervised pair, with the answer-loss propagating back across the full evidence chain. The result is that the same trajectory data, only reformatted, teaches the model to integrate distant evidence end-to-end. The paper reports +18.1 points on MRCR for Qwen3-30B-A3B, matching Qwen3-235B-A22B at roughly 8x fewer parameters.

Why does ACC SFT outperform standard agent SFT on long-context tasks?

Because the two formats teach different skills. Standard agent SFT trains the model to produce the next assistant turn given the running conversation — useful for tool-call planning, but it never asks the model to integrate distant tool-output evidence into a single answer. ACC flips the training task to 'produce the final answer given a long input that includes the unmasked tool outputs.' That task forces the answer-token loss to attend back over the full evidence chain, which is exactly the behavior long-context retrieval benchmarks like MRCR measure. The paper measures the gap directly: +18.1 MRCR points on Qwen3-30B-A3B after ACC SFT, with general capability preserved on GPQA, MMLU-Pro, AIME, and IFEval.

Does ACC require new annotation or new agent runs?

No. ACC re-uses agent trajectories that any team running an agent harness already collects. The conversion is purely a format change — strip the multi-turn role tags, concatenate the user query with all tool outputs in order, treat the final answer as the supervised target. The paper makes this a selling point: a single existing log of agent trajectories can produce both standard agent-SFT data (multi-turn format) and ACC long-context QA data (flat format) at the cost of a re-serialization pass. The same trace teaches the model two different skills.

ACC paper — Tool-output unmasking

ACC — tool-output unmasking turns agent trajectories into long-context QA

LLM

learnaivisually.com/ai-explained/acc-tool-output-unmasking

Jargon

SFT: Supervised Fine-Tuning — the post-training phase where a base LLM is taught a target behavior on (input, output) pairs. The model's loss is computed only on positions inside the output portion of each pair; everything else is part of the input context.
Loss mask: A per-token flag that tells the trainer which positions contribute to the gradient. In agent SFT, tool outputs are typically masked out (loss flag = 0) because they were "produced" by the environment, not by the model. The model still reads them in context — it just doesn't learn to generate them.
MRCR: Multi-task Retrieval Capability Reasoning — a long-context retrieval benchmark from the Qwen team that scores how well a model can locate and stitch together distant evidence inside a single very long input. Higher is better.
GraphWalks: A long-context benchmark that probes whether a model can follow multi-hop reasoning chains across a graph encoded as text. Co-reported with MRCR in the ACC paper.
Agent trajectory: The full sequence of (user query → assistant reasoning → tool call → tool output → assistant reasoning → … → final answer) that an agent produces while solving a task. Each trajectory is normally used once as agent-SFT data; ACC re-uses the same trajectory in a second format.
Qwen3-235B-A22B: A mixture-of-experts member of the Qwen3 family with ~235B total parameters but only ~22B active per token. Used in the paper as the long-context retrieval ceiling that the much smaller 30B model — after ACC SFT — matches.

The news. On May 21, 2026, a USTC team posted a paper showing that the same agent trajectory data can be turned into long-context training pairs by unmasking the tool outputs. On the recipe — they call it ACC — Qwen3-30B-A3B gains +18.1 points on MRCR (long-context retrieval) and +7.6 on GraphWalks, matching Qwen3-235B-A22B at roughly 8× fewer parameters, while general capability on GPQA, MMLU-Pro, AIME, and IFEval is preserved. Read the paper →

Picture a math textbook with a strange layout. The question is on the first page — "How many electric vehicle units sold in Q1 2026?" — and the final answer is on the last page — "approximately 3.2 million units." In between, where the worked solution should be, every line is blacked out. A student studying this textbook sees that a question maps to an answer, but never learns the chain of reasoning that connects them. That is exactly the shape of a standard agent SFT trajectory: the assistant turns and the final answer carry training signal, but the tool outputs in between — search results, code outputs, file contents — are masked from the loss because they were emitted by the environment, not the model.

The ACC paper's move is to un-redact the textbook. It takes each trajectory — query, every intermediate assistant turn, every tool call and every tool output — strips out the multi-turn role structure, and emits a single supervised pair: (query + assembled context → final answer). The tool outputs are no longer masked; they are now part of the input the model conditions on, with the answer-loss attending back over the full evidence chain. The model is forced to integrate distant pieces of evidence end-to-end, exactly the skill that long-context retrieval demands. Crucially, no new annotation is needed — the data already exists in any team's agent-trajectory log.

Where the long-context signal actually comes from

Suppose an agent answers "How did EV sales shift in Q1 2026?" with two search calls. The first tool output returns ~800 tokens of news context; the second returns ~600 tokens of regional breakdown. In standard SFT, those 1,400 tokens of tool output are excluded from the loss — the model sees them while generating but never learns to draw evidence from them. ACC strips the role structure and emits a single training pair: [user query | 1,400 tokens of assembled context → final answer]. The loss now propagates back through the full 1,400-token attention chain, teaching the model to integrate distant evidence end-to-end. (Illustrative — the paper does not publish per-trajectory token budgets, but the numbers above are within the range typical of real agent traces.)

The benchmark deltas put the recipe's value in concrete terms. On MRCR, Qwen3-30B-A3B climbs from its base score by +18.1 points to 68.3 after ACC SFT. On GraphWalks, it climbs +7.6 points to 77.5. The most striking comparison is parameter efficiency: Qwen3-30B-A3B after ACC matches Qwen3-235B-A22B on the same retrieval benchmark — roughly 8× fewer parameters for the same long-context behavior. General capability is preserved on GPQA, MMLU-Pro, AIME, and IFEval, which matters because aggressive training-data reshaping often costs accuracy on out-of-distribution evals.

How ACC compares to other ways to teach long-context

Approach	Data source	What the loss attends to	Cost to acquire
Standard agent SFT	Multi-turn agent trajectories	Assistant tokens only; tool outputs masked	Already in your trajectory log
Synthetic long-context QA (e.g., needle-in-haystack)	Hand-constructed or LLM-synthesized	Answer tokens attending over a synthetic long input	New annotation pipeline + quality risk
ACC (this paper)	Same agent trajectories, unmasked and flattened	Answer tokens attending over the full unmasked evidence chain	Zero new annotation — re-uses existing logs

The structural takeaway lands in two places at once. For an LLM training team, ACC is a near-free upgrade to long-context behavior: the data is already in the log, only the format changes. For an agent-engineering team, it reframes what an agent trajectory is for — not just a record of how the agent solved one task, but a long-context QA exemplar for the next round of base-model training. The same trace that drove a single agent run becomes the training data that teaches the next model to handle the kind of long-context retrieval the agent was doing the slow way with tool calls.

Goes deeper in: AI Agents → Context Engineering → Context as Scarce Resource — the lens that explains why long-context retrieval is the throughput bottleneck ACC is trying to lift in the base model.

Related explainers

EnvFactory — synthetic envs for tool-use agent training — the other end of the agent-SFT data pipeline: generating the trajectories ACC then reformats
FutureSim — harness-level agent eval vs single-shot QA — the evaluation surface that shows whether ACC-style training data actually transfers to multi-turn behavior
TIM — Training-Inference Mismatch in RL — a different SFT-format failure mode: training data that does not match the distribution the model sees at inference

Continue in trackContext Engineering: Context as scarce resource

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based