The news. On June 24, 2026, the OpenThoughts-Agent team released a fully open recipe for training capable agents: 100,000 curated SFT examples, the complete data-curation pipeline, 100+ ablations, and fine-tuned models. Built by systematically varying task sources and diversity, it reaches 44.8% average accuracy across seven agentic benchmarks, beating Nemotron-Terminal-32B (40.9%) by 3.9 points, and reportedly scales better than alternative open datasets at every training-set size. Read the paper →
Imagine onboarding a new hire and wanting them to handle anything that lands on their desk. The temptation is to drill them on the one task you have the most paperwork for — but a hire who only ever worked one desk falls apart the moment something unfamiliar arrives. The hire who was rotated through many departments — billing, support, logistics, returns — has seen enough different shapes of problem to improvise on a new one. That rotation is the whole idea behind OpenThoughts-Agent: an agent is only as broadly capable as the variety of tasks it was trained on.
Concretely, training an agent by SFT means showing it thousands of worked trajectories — recorded runs of an agent using tools to finish a job. The easy way to get a bigger dataset is to generate more trajectories from the same source. OpenThoughts-Agent's 100+ ablations show that scaling that single axis hits diminishing returns: past a point, more of the same source barely helps, while adding new kinds of source keeps lifting capability. The contribution is not a new model architecture — it is a data-curation methodology that identifies which task sources, in which mix, produce an agent that generalizes.
Walk the comparison with the paper's own numbers. The OpenThoughts-Agent recipe scores 44.8% average across seven agentic benchmarks, against 40.9% for Nemotron-Terminal-32B — a +3.9 percentage-point gap from changing what the training data is made of, not the model. And crucially, the curve doesn't cross: the diverse recipe wins at every training-set size, so the advantage is not "they just used more data." A practical takeaway from these results — when an agent underperforms, the first question may not be "do we have enough trajectories" but "are they varied enough."
| Training data | What you scale | Result on the 7 benchmarks |
|---|---|---|
| Single dominant source | more of the same trajectories | Plateaus — over-fits to one task shape |
| Nemotron-Terminal-32B (prior open baseline) | a strong but narrower mix | 40.9% avg [paper] |
| OpenThoughts-Agent (diverse sources) | variety of task sources | 44.8% avg — +3.9pp, and ahead at every data size [paper] |
The honest caveats. "Diversity" here is measured by the sources the authors had access to, so the ceiling depends on which environments you can sample at all — you cannot rotate a hire through a department that does not exist. And SFT is only one stage; many strong agents add reinforcement learning on top, which this recipe does not replace. But the open release is the real gift: the dataset, the pipeline, and the ablations are public, so the "diversity beats volume" claim is one anyone can reproduce and push on.
Goes deeper in: AI Agents → Tool Use → Why Tools?
Related explainers
- OpenThoughts-Agent's sibling benchmark — NatureBench — on whether agentic benchmark scores actually predict real capability, the thing this recipe is optimizing.
- EnvFactory — synthesizing tool environments — one way to manufacture the diverse task sources this recipe shows are the bottleneck.
- Agent environment survey — symbolic vs neural synthesis — the broader map of where agent training tasks come from.