What is task-source diversity in agent SFT?

Task-source diversity is how varied the origins of an agent's training tasks are — coding benchmarks, web tasks, terminal environments, synthetic pipelines, and so on. In supervised fine-tuning (SFT), an agent learns from recorded trajectories of solved tasks; OpenThoughts-Agent's 100+ ablations show that mixing many sources, rather than scaling up one source, is the main driver of broad agent capability. Its open 100,000-example recipe reaches 44.8% average across seven agentic benchmarks on this principle.

Why does diversity matter more than data volume?

Past a point, more trajectories from the same source teach the agent nothing new — it over-fits to one shape of problem, like a hire who only ever worked one desk. Adding new kinds of source keeps exposing the agent to unfamiliar problems, which is what generalization requires. OpenThoughts-Agent reports that its diverse recipe beats alternative open datasets at every training-set size, so the win comes from the composition of the data, not its quantity.

What did OpenThoughts-Agent actually release?

It is a fully open recipe: 100,000 curated agent-SFT examples, the complete data-curation pipeline that built them, over 100 controlled ablations on task sources and diversity, and fine-tuned model checkpoints. The headline result is 44.8% average accuracy across seven agentic benchmarks, a 3.9-point gain over Nemotron-Terminal-32B (40.9%). Because the dataset and pipeline are public, the diversity-beats-volume finding is reproducible.

OpenThoughts-Agent open-sources a 100K-example agent training recipe — Task-source diversity in agent SFT

TL;DR

What is it: The OpenThoughts-Agent release (arXiv 2606.24855) is a fully open 100,000-example recipe for agent supervised fine-tuning (SFT) — the dataset, the curation pipeline, 100+ ablations, and fine-tuned checkpoints. The idea it makes concrete is task-source diversity in agent SFT.
Why it’s needed: It is a reproducible reference for how you make an agent capable — and its headline finding is that where the training tasks come from (their variety of sources), not just how many you have, is the main lever on broad capability.
vs previous: The usual instinct is to scale up one task source — collect more of the same kind of trajectory; OpenThoughts-Agent shows that a diverse mix of sources beats a bigger single source at every data size, reaching 44.8% where the strongest prior open agent (Nemotron-Terminal-32B) sat at 40.9%.

Jargon

SFT: Supervised fine-tuning — continuing to train a base model on worked examples of the behavior you want (here, agent trajectories that solve tasks). It is how a general model is turned into a capable agent.
Agent trajectory: One full recorded run of an agent solving a task — its steps, tool calls, and results. These trajectories are the training examples; OpenThoughts-Agent curated 100,000 of them.
Task source: Where a training task comes from — a coding benchmark, a web task generator, a terminal environment, a synthetic pipeline. The paper's core claim is that mixing many sources matters more than piling up one.
Task-source diversity: How varied the origins of the training tasks are. High diversity exposes the agent to many shapes of problem; low diversity over-fits it to one. This is the lever the 100+ ablations isolate.
Ablation: A controlled experiment that removes or varies one ingredient to measure its effect. OpenThoughts-Agent ran 100+ ablations on task sources and diversity to find what actually drives capability.
Agentic benchmark: A test that scores an agent on realistic multi-step tasks (not single questions) — using tools, a terminal, or a browser. The recipe is measured across seven of them.

The news. On June 24, 2026, the OpenThoughts-Agent team released a fully open recipe for training capable agents: 100,000 curated SFT examples, the complete data-curation pipeline, 100+ ablations, and fine-tuned models. Built by systematically varying task sources and diversity, it reaches 44.8% average accuracy across seven agentic benchmarks, beating Nemotron-Terminal-32B (40.9%) by 3.9 points, and reportedly scales better than alternative open datasets at every training-set size. Read the paper →

Imagine onboarding a new hire and wanting them to handle anything that lands on their desk. The temptation is to drill them on the one task you have the most paperwork for — but a hire who only ever worked one desk falls apart the moment something unfamiliar arrives. The hire who was rotated through many departments — billing, support, logistics, returns — has seen enough different shapes of problem to improvise on a new one. That rotation is the whole idea behind OpenThoughts-Agent: an agent is only as broadly capable as the variety of tasks it was trained on.

Concretely, training an agent by SFT means showing it thousands of worked trajectories — recorded runs of an agent using tools to finish a job. The easy way to get a bigger dataset is to generate more trajectories from the same source. OpenThoughts-Agent's 100+ ablations show that scaling that single axis hits diminishing returns: past a point, more of the same source barely helps, while adding new kinds of source keeps lifting capability. The contribution is not a new model architecture — it is a data-curation methodology that identifies which task sources, in which mix, produce an agent that generalizes.

Walk the comparison with the paper's own numbers. The OpenThoughts-Agent recipe scores 44.8% average across seven agentic benchmarks, against 40.9% for Nemotron-Terminal-32B — a +3.9 percentage-point gap from changing what the training data is made of, not the model. And crucially, the curve doesn't cross: the diverse recipe wins at every training-set size, so the advantage is not "they just used more data." A practical takeaway from these results — when an agent underperforms, the first question may not be "do we have enough trajectories" but "are they varied enough."

Training data	What you scale	Result on the 7 benchmarks
Single dominant source	more of the same trajectories	Plateaus — over-fits to one task shape
Nemotron-Terminal-32B (prior open baseline)	a strong but narrower mix	40.9% avg [paper]
OpenThoughts-Agent (diverse sources)	variety of task sources	44.8% avg — +3.9pp, and ahead at every data size [paper]

The honest caveats. "Diversity" here is measured by the sources the authors had access to, so the ceiling depends on which environments you can sample at all — you cannot rotate a hire through a department that does not exist. And SFT is only one stage; many strong agents add reinforcement learning on top, which this recipe does not replace. But the open release is the real gift: the dataset, the pipeline, and the ablations are public, so the "diversity beats volume" claim is one anyone can reproduce and push on.

Goes deeper in: AI Agents → Tool Use → Why Tools?

Related explainers

OpenThoughts-Agent's sibling benchmark — NatureBench — on whether agentic benchmark scores actually predict real capability, the thing this recipe is optimizing.
EnvFactory — synthesizing tool environments — one way to manufacture the diverse task sources this recipe shows are the bottleneck.
Agent environment survey — symbolic vs neural synthesis — the broader map of where agent training tasks come from.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based