The news. On June 24, 2026, the OpenThoughts-Agent team released a fully open recipe for training capable agents: 100,000 curated SFT examples, the complete data-curation pipeline, 100+ ablations, and fine-tuned models. Built by systematically varying task sources and diversity, it reaches 44.8% average accuracy across seven agentic benchmarks, beating Nemotron-Terminal-32B (40.9%) by 3.9 points, and reportedly scales better than alternative open datasets at every training-set size. Read the paper →

Imagine onboarding a new hire and wanting them to handle anything that lands on their desk. The temptation is to drill them on the one task you have the most paperwork for — but a hire who only ever worked one desk falls apart the moment something unfamiliar arrives. The hire who was rotated through many departments — billing, support, logistics, returns — has seen enough different shapes of problem to improvise on a new one. That rotation is the whole idea behind OpenThoughts-Agent: an agent is only as broadly capable as the variety of tasks it was trained on.

Concretely, training an agent by SFT means showing it thousands of worked trajectories — recorded runs of an agent using tools to finish a job. The easy way to get a bigger dataset is to generate more trajectories from the same source. OpenThoughts-Agent's 100+ ablations show that scaling that single axis hits diminishing returns: past a point, more of the same source barely helps, while adding new kinds of source keeps lifting capability. The contribution is not a new model architecture — it is a data-curation methodology that identifies which task sources, in which mix, produce an agent that generalizes.

Walk the comparison with the paper's own numbers. The OpenThoughts-Agent recipe scores 44.8% average across seven agentic benchmarks, against 40.9% for Nemotron-Terminal-32B — a +3.9 percentage-point gap from changing what the training data is made of, not the model. And crucially, the curve doesn't cross: the diverse recipe wins at every training-set size, so the advantage is not "they just used more data." A practical takeaway from these results — when an agent underperforms, the first question may not be "do we have enough trajectories" but "are they varied enough."

Training dataWhat you scaleResult on the 7 benchmarks
Single dominant sourcemore of the same trajectoriesPlateaus — over-fits to one task shape
Nemotron-Terminal-32B (prior open baseline)a strong but narrower mix40.9% avg [paper]
OpenThoughts-Agent (diverse sources)variety of task sources44.8% avg — +3.9pp, and ahead at every data size [paper]

The honest caveats. "Diversity" here is measured by the sources the authors had access to, so the ceiling depends on which environments you can sample at all — you cannot rotate a hire through a department that does not exist. And SFT is only one stage; many strong agents add reinforcement learning on top, which this recipe does not replace. But the open release is the real gift: the dataset, the pipeline, and the ablations are public, so the "diversity beats volume" claim is one anyone can reproduce and push on.

Goes deeper in: AI Agents → Tool Use → Why Tools?

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based