What does 'distilling delegation into the weights' mean?

It means turning good delegation behavior into training data and fine-tuning a base model on it. SearchSwarm runs a strong model inside a harness that pushes it toward high-quality task decomposition and tidy subagent results, collects those trajectories, and uses supervised fine-tuning so the base model learns when and how to split and delegate by default — instead of relying on a prompt to act like a manager at inference time.

Why does SearchSwarm delegate instead of using one agent?

Long-horizon research touches far more evidence than fits coherently in one context window. A single agent that reads everything itself loses track of early findings. Delegating slices to subagents that each return a short, clean summary keeps the orchestrator's working context small, so a 3B-active model can stay coherent across a long task — which is how SearchSwarm-30B-A3B reaches 68.1 on BrowseComp.

How is this different from an RL orchestrator like Maestro?

Maestro learns a routing policy with reinforcement learning and dispatches work to frozen expert models. SearchSwarm instead bakes the decomposition-and-delegation skill into one base model's weights via supervised fine-tuning on harness-generated trajectories. One trains a router over fixed experts; the other trains a single model to be a good delegator.

SearchSwarm hits SOTA on BrowseComp with a 30B agent — Distilling delegation into the weights

Jargon

Orchestrator vs subagent: The two roles in a delegating agent. The orchestrator splits a goal into pieces and hands each out; a subagent runs one piece and reports back. The orchestrator coordinates; it doesn't do the legwork itself.
Task decomposition: Breaking one big goal ("answer this research question") into smaller, independently-runnable sub-tasks ("find the launch date", "check the spec"). Good decomposition is the hard part of delegation.
SFT (supervised fine-tuning): Training a model on example (input → desired output) pairs. Here the "desired outputs" are good delegation trajectories, so the model learns to imitate them.
Trajectory: The full step-by-step record of an agent solving one task — every decomposition, dispatch, and sub-result. SearchSwarm collects high-quality trajectories and turns them into SFT training data.
Distillation (here): In this paper, distillation means cloning behavior: capture how a strong, well-scaffolded model delegates, then SFT a base model on those traces. It is not the logit-matching (teacher-softmax) distillation used to shrink models.
30B-A3B: A Mixture-of-Experts sizing: 30 billion total parameters but only ~3 billion active per token, so only ~3B parameters do the compute on each token while the full 30B holds the knowledge.
BrowseComp: A hard benchmark for web-browsing research agents — questions whose answers require chasing evidence across many pages. BrowseComp-ZH is the Chinese-language counterpart.

The news. On June 8, 2026, researchers released SearchSwarm, a deep-research agent that learns to delegate. Instead of prompting a generic model to act like a manager, the authors build a harness that pushes a strong model toward high-quality task decomposition, force its subagents to return tidy results, and then use those runs as supervised fine-tuning data — baking delegation into the base model's weights. The resulting SearchSwarm-30B-A3B reports 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best among comparable-scale models. Read the paper →

Picture a veteran general contractor on a job site. They never pick up a hammer. Their entire skill is delegation: look at "build a house," instantly split it into framing, plumbing, and wiring, hand each to the right subcontractor with a tight, one-line scope, and accept only a clean report back — "framing's up, inspected" — not a truckload of sawdust dumped on the office floor. A rookie GC with the same blueprint flails: they forget to split a phase, micromanage one sub while another stalls, and end up buried in detail they should have delegated away. Same blueprint, wildly different outcome — and the difference is a learned skill, not a checklist.

That gap is exactly the one SearchSwarm closes for long-horizon web research. A single agent that tries to answer a hard BrowseComp question by reading every page itself drowns in its own context — by the hundredth search result, the first lead has scrolled out of attention's reach. The known remedy is the orchestrator–workers shape: the orchestrator dispatches subagents that each chase one slice and return a short summary, so the manager's working context stays small and coherent. The catch is that prompting a general model to play this manager is brittle — it was never trained to decompose, so it skips the split, or it lets a subagent dump its whole transcript back and re-floods the very context delegation was supposed to protect.

SearchSwarm's move is to stop hoping and start training. The authors wrap a strong model in a delegation harness that nudges it toward good decompositions and constrains every subagent to return a clean, formatted result. The high-quality runs that fall out — full trajectories of "split here, dispatch this, accept that summary" — become supervised fine-tuning data. Fine-tune the base model on them and delegation stops being a fragile prompt and becomes a reflex baked into the weights: the veteran contractor, not the rookie reading the manual. Because the model now keeps each subagent's context isolated by default, a tidy orchestrator context is the trained behavior, not a lucky one.

Where delegation lives	How you get it	Failure mode	Example
In the prompt / harness, at inference	scaffold a generic model with an orchestrator–workers prompt	brittle — the model was never trained to split work	generic orchestrator–workers agent
In a learned routing policy	RL trains an orchestrator to route to frozen expert models	needs reward design; experts stay fixed	Maestro
In the base model's weights	SFT on harness-generated delegation trajectories	needs a good teacher harness to generate the traces	SearchSwarm-30B-A3B

Why does a clean context matter enough to train for it? Walk the budget (token counts here are illustrative — the paper reports the benchmark scores, not these figures). Say a BrowseComp question needs evidence from 40 web pages, and each raw page is ~2,000 tokens. A single agent that reads them all carries 80,000 tokens of raw page text in its working context — and long before the end, the early evidence has fallen out of reach. SearchSwarm's orchestrator instead splits the hunt into, say, 5 sub-searches, hands each to a subagent, and gets back a 200-token verified summary. The orchestrator's context now holds just 1,000 tokens of clean findings — an 80× smaller working set — so it can still reason over the first lead when it reaches the last. That preserved coherence is what lets a 3B-active model stay on-track across a long horizon and post 68.1 on BrowseComp — the best among comparable-scale models.

Goes deeper in: AI Agents → Workflow Patterns → Orchestrator–Workers

Related explainers

Maestro — RL orchestrator over frozen experts — a different way to get delegation: learn a routing policy instead of baking decomposition into one model's weights
MSR delegation study — fidelity drift over iterations — what goes wrong when delegation is sloppy: detail leaks and degrades down the chain
GrepSeek — GRPO-trained shell-command search — another search agent trained (with RL, not SFT) to do its job well

Continue in trackAgent Engineering — Agent Teams: coordinating a supervisor and its worker agents in production

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based