The news. On June 15, 2026, Microsoft released FastContext, a system that attacks the most expensive thing a coding agent does: finding the right code. Analyzing GPT-5.4 trajectories, the authors found reading and searching accounted for 56.2% of tool-use turns and 46.5% of the main agent's total tokens. FastContext trains dedicated 4B-30B exploration models that the main agent queries in natural language; the explorer fires read-only Read/Glob/Grep calls in parallel and returns focused file-line citations. Plugged into Mini-SWE-Agent, it reports up to +5.5% resolution rate and up to 60% fewer tokens. Weights are open on Hugging Face. Read the paper →

Picture yourself at a small desk in a vast library, trying to answer one question. The naive way is to walk the stacks yourself, haul every promising book back, and stack them on the desk — and within a dozen volumes the desk is buried, the early books slide onto the floor, and you can't even see the question anymore. The desk is the bottleneck, and you filled it with raw material you mostly didn't need. A coding agent does exactly this when it greps and reads files itself: every file it opens lands in its own context window, and long before it starts writing the fix, the window is full of source it skimmed once and will never look at again.

That's not a small inefficiency — it's the inefficiency. When FastContext's authors traced real GPT-5.4 coding runs, reading and searching the repository accounted for 56.2% of every tool-use turn and 46.5% of the main agent's tokens. Roughly half the agent's entire budget goes to finding code, not changing it. And exploration is the most context-poisoning kind of work there is: it pulls in big, low-signal blobs of text whose only useful output is usually a single line number.

So FastContext stops doing the exploring on the main desk. It sends a librarian into the stacks. The main agent delegates a natural-language query — "where is the retry budget enforced?" — to a separate explorer subagent, a 4B-30B model trained for exactly this. The explorer reads, globs, and greps its way through the repo in parallel read-only calls, then hands back not an armful of files but an index card: scheduler/retry.go:88-104, the exact evidence. The main agent's desk stays clear, holding citations instead of haystacks — the reading happened, but the bulk never touched the context that has to reason. Because the explorer only ever uses read-only tools, running a swarm of those searches at once is safe by construction.

The explorer earns its accuracy in two training stages. First supervised fine-tuning teaches it to imitate good exploration traces; then task-grounded RL rewards it not for searches that merely look thorough but for evidence that actually lets the main agent solve the downstream task. A scout that brings back the wrong shelf is worse than useless, so the reward is tied to the outcome, not the search.

Who reads the repoWhat lands in the main contextCost
Main agent itself (baseline)every file it opens — raw source~46.5% of tokens spent exploring (paper)
A prompted, untrained sub-calloften the whole transcript dumped backre-floods context; little net saving (illustrative)
FastContext explorer subagentcompact file-line citations onlyup to 60% fewer tokens, +5.5% resolution (paper)

Where does a 60% cut actually come from? Walk one task (token counts here are illustrative — the paper reports the percentages, not these absolute numbers). Say solving a bug needs evidence from 12 files averaging 1,500 tokens each. A baseline agent that reads them all carries 18,000 tokens of raw source in its working context — and that's before it writes a line. FastContext's explorer reads the same 12 files in its own scratch context, then returns 12 citations at ~40 tokens each = ~480 tokens. The main agent now reasons over ~480 tokens instead of 18,000 — a ~37× lighter exploration footprint on the desk that matters. Multiply that across a long task where exploration was already 46.5% of the budget, and a headline 60% token reduction stops looking surprising — it's just the haystack never landing on the desk.

Goes deeper in: AI Agents → Context Engineering → Subagents for context isolation

Related explainers

Continue in trackAgent Engineering — Cost & Latency: profile where an agent's tokens actually go before you optimize

Frequently Asked Questions