The news. On June 29, 2026, the Agents-A1 paper (arXiv 2606.30616) reported that a 35B-parameter Mixture-of-Experts agent matches or beats trillion-parameter systems like Kimi-K2.6 and DeepSeek-V4-pro on several agentic benchmarks (SEAL-0 56.4, IFBench 80.6). The headline claim: the gains come from extending the agent's reasoning horizon — training on long-horizon trajectories averaging ~45K tokens — rather than from growing the parameter count. Read the paper →
A medical resident does not become a good doctor by memorizing more textbooks. They become good by working full shifts — start to finish, patient after patient, decisions stacked on decisions, staying oriented when the night drags on. Agents-A1 makes the same bet for AI agents: you build a better long-task agent not by stuffing more knowledge into its weights, but by training it through long, complete task runs.
The usual way to make an agent stronger is to add parameters — the trillion-parameter route taken by Kimi-K2.6 and DeepSeek-V4-pro. More weights mean more knowledge baked in, like a resident who has read more textbooks. But an agentic task is not a single lookup; it is a long chain of decisions where the model has to stay coherent and not lose the thread. Raw textbook recall does not teach that by itself; the skill mostly shows up on a real shift.
Agents-A1 is a 35B Mixture-of-Experts model, small by frontier standards. Its edge comes from where the training compute goes: ~45K-token long-horizon trajectories — complete task runs from the first action to the final answer. Trained on whole shifts instead of quick consults, the model learns to spend its reasoning budget on more steps rather than on more weights, and to keep deciding when to push on and when to stop over a long task.
How do you train a small model to absorb that much skill? The recipe is three stages: a broad domain-wide SFT pass, then a specialized teacher model per domain, then multi-teacher on-policy distillation with vocabulary alignment. On-policy is the load-bearing word: the teachers correct the student's calls on the student's own rollouts — the attendings critiquing your decisions on your own patients — not by handing over finished charts. Vocabulary alignment is the plumbing that lets the student learn from teachers with different tokenizers at all.
Put it in numbers. SEAL-0 is an agentic benchmark, and Kimi-K2.6 and DeepSeek-V4-pro reach their scores in the trillion-parameter class. Agents-A1 posts SEAL-0 56.4 and IFBench 80.6 at just 35B parameters (arXiv 2606.30616). Matching that tier at roughly 1T versus 35B is about a 28× cut in parameters (illustrative — not every trillion-param count is public). The bet is that the capacity you do not spend on weights, you spend on horizon: trajectories that average ~45K tokens, long enough that the model practices holding a plan across dozens of steps instead of memorizing dozens more facts.
| Strategy | What you grow | Example | What it buys |
|---|---|---|---|
| Scale the parameters | model weights — knowledge baked in | Kimi-K2.6, DeepSeek-V4-pro (trillion-class) | broad recall; serving cost and memory grow with size |
| Scale the horizon | training-trajectory length | Agents-A1 (35B MoE, ~45K-token runs, arXiv 2606.30616) | practice at multi-step, coherent task execution |
What makes this more than a training trick is the claim that the two axes are partly interchangeable: past a point, a longer training horizon can stand in for raw parameters on agentic work. If it holds beyond these benchmarks, it reframes "make the agent smarter" from buy a bigger model to let a smaller one practice longer, complete tasks — a very different bill for anyone serving agents.
Goes deeper in: AI Agents → Planning & Reflection → When to Spend More Tokens
Related explainers
- OPID — On-policy skill distillation — the same on-policy distillation family that powers Agents-A1's training recipe
- SearchSwarm — Distilling delegation into the weights — another small agent (30B) matching far larger ones, here by distilling a multi-agent policy
- Effective Feedback Compute — a scaling-law cousin: what actually predicts agent success is feedback quality, not raw compute