What does scaling the reasoning horizon mean?

It means making an agent stronger by training it on longer, complete task runs rather than by adding parameters. Agents-A1 (arXiv 2606.30616) trains on long-horizon trajectories that average about 45K tokens — whole task runs from the first action to the final answer — so the model practices holding a plan across many tool-calling steps. The horizon is how far ahead the agent has to reason and act; lengthening the training horizon teaches multi-step coherence that raw model size does not.

How can a 35B agent match trillion-parameter models?

Agents-A1 is a 35B Mixture-of-Experts model that the authors report matches or beats trillion-parameter systems like Kimi-K2.6 and DeepSeek-V4-pro on agentic benchmarks (SEAL-0 56.4, IFBench 80.6). The argument is that agentic skill comes from practicing long, complete tasks, not from baking in more knowledge — so a small model trained on ~45K-token trajectories can rival a much larger one whose budget went into parameters. The numbers are the paper's own and apply to the benchmarks tested.

How does Agents-A1 relate to distillation?

Distillation is how the small model absorbs the skill. Agents-A1's recipe is three stages: domain-wide supervised fine-tuning, a specialized teacher model per domain, then multi-teacher on-policy distillation with vocabulary alignment. On-policy means the teachers correct the student on the student's own attempts rather than on the teachers' finished outputs, and vocabulary alignment lets the student learn from teachers that use different tokenizers.

Agents-A1 matches trillion-param agents at 35B — Scaling the horizon, not the parameters

TL;DR

What is it: The paper Agents-A1 (arXiv 2606.30616) reports a 35B-parameter Mixture-of-Experts agent that matches or beats trillion-parameter systems on agentic benchmarks — and credits the win to training on long-horizon task trajectories, not to a bigger model.
Why it’s needed: An agentic task is a long chain of tool calls where the model must stay coherent across dozens of steps; where you spend the training compute — on horizon versus on parameters — is the core lever for the Planning & Reflection and Agent Loop stages.
vs previous: The standard route to a stronger agent is to add parameters (the trillion-param models Kimi-K2.6 and DeepSeek-V4-pro). Agents-A1 instead keeps the model small and lengthens the training trajectory to ~45K tokens, so the budget buys more reasoning steps rather than more weights.

Jargon

Mixture-of-Experts (MoE): A model split into many small "expert" sub-networks; each token is routed to only a few of them, so the active parameters per token are far fewer than the total. Agents-A1 has 35B total parameters.
Horizon: How many steps of reasoning and tool use a task takes end to end. A long-horizon task is one the agent must work through over many loop ticks without losing the thread.
Long-horizon trajectory: One complete recorded task run — every step from the first action to the final answer. Agents-A1's training trajectories average ~45K tokens, long enough that the model practices holding a plan across dozens of steps.
On-policy distillation: The student learns from a teacher's corrections on the student's own attempts, not on the teacher's finished outputs. Training on its own mistakes keeps the lessons relevant to what the student actually does.
Vocabulary alignment: A fix that lets one student absorb signals from teachers that use different tokenizers. Without it, two models chop text into different token sets and their probabilities can't be compared directly.
SFT (supervised fine-tuning): Training a model on labeled input-output examples. Here it is the first stage — a broad, domain-wide pass — before the per-domain teachers and the distillation step.

The news. On June 29, 2026, the Agents-A1 paper (arXiv 2606.30616) reported that a 35B-parameter Mixture-of-Experts agent matches or beats trillion-parameter systems like Kimi-K2.6 and DeepSeek-V4-pro on several agentic benchmarks (SEAL-0 56.4, IFBench 80.6). The headline claim: the gains come from extending the agent's reasoning horizon — training on long-horizon trajectories averaging ~45K tokens — rather than from growing the parameter count. Read the paper →

A medical resident does not become a good doctor by memorizing more textbooks. They become good by working full shifts — start to finish, patient after patient, decisions stacked on decisions, staying oriented when the night drags on. Agents-A1 makes the same bet for AI agents: you build a better long-task agent not by stuffing more knowledge into its weights, but by training it through long, complete task runs.

The usual way to make an agent stronger is to add parameters — the trillion-parameter route taken by Kimi-K2.6 and DeepSeek-V4-pro. More weights mean more knowledge baked in, like a resident who has read more textbooks. But an agentic task is not a single lookup; it is a long chain of decisions where the model has to stay coherent and not lose the thread. Raw textbook recall does not teach that by itself; the skill mostly shows up on a real shift.

Agents-A1 is a 35B Mixture-of-Experts model, small by frontier standards. Its edge comes from where the training compute goes: ~45K-token long-horizon trajectories — complete task runs from the first action to the final answer. Trained on whole shifts instead of quick consults, the model learns to spend its reasoning budget on more steps rather than on more weights, and to keep deciding when to push on and when to stop over a long task.

How do you train a small model to absorb that much skill? The recipe is three stages: a broad domain-wide SFT pass, then a specialized teacher model per domain, then multi-teacher on-policy distillation with vocabulary alignment. On-policy is the load-bearing word: the teachers correct the student's calls on the student's own rollouts — the attendings critiquing your decisions on your own patients — not by handing over finished charts. Vocabulary alignment is the plumbing that lets the student learn from teachers with different tokenizers at all.

Put it in numbers. SEAL-0 is an agentic benchmark, and Kimi-K2.6 and DeepSeek-V4-pro reach their scores in the trillion-parameter class. Agents-A1 posts SEAL-0 56.4 and IFBench 80.6 at just 35B parameters (arXiv 2606.30616). Matching that tier at roughly 1T versus 35B is about a 28× cut in parameters (illustrative — not every trillion-param count is public). The bet is that the capacity you do not spend on weights, you spend on horizon: trajectories that average ~45K tokens, long enough that the model practices holding a plan across dozens of steps instead of memorizing dozens more facts.

Strategy	What you grow	Example	What it buys
Scale the parameters	model weights — knowledge baked in	Kimi-K2.6, DeepSeek-V4-pro (trillion-class)	broad recall; serving cost and memory grow with size
Scale the horizon	training-trajectory length	Agents-A1 (35B MoE, ~45K-token runs, arXiv 2606.30616)	practice at multi-step, coherent task execution

What makes this more than a training trick is the claim that the two axes are partly interchangeable: past a point, a longer training horizon can stand in for raw parameters on agentic work. If it holds beyond these benchmarks, it reframes "make the agent smarter" from buy a bigger model to let a smaller one practice longer, complete tasks — a very different bill for anyone serving agents.

Goes deeper in: AI Agents → Planning & Reflection → When to Spend More Tokens

Related explainers

OPID — On-policy skill distillation — the same on-policy distillation family that powers Agents-A1's training recipe
SearchSwarm — Distilling delegation into the weights — another small agent (30B) matching far larger ones, here by distilling a multi-agent policy
Effective Feedback Compute — a scaling-law cousin: what actually predicts agent success is feedback quality, not raw compute

Continue in trackAI Agents — Planning & Reflection: when to spend more tokens

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based