The news. On June 23, 2026, researchers released NatureBench, a benchmark of 90 tasks sourced from peer-reviewed Nature-family papers. Each task asks a coding agent not to reproduce a result but to beat the paper's published state of the art, judged by an effect-size threshold (g > 0.1). The strongest model surpasses SOTA on only 17.8% of tasks, exposing a large gap between reproduction and genuine discovery. The tasks are built by NatureGym, an automated pipeline that turns each paper into a standardized, per-task containerized environment. Read the paper →
Picture a cook-off with a twist. The judges are not asking you to recreate the famous chef's signature dish — they already have it scored and on the board — they are asking whether what you plate is measurably better than the record. Plenty of cooks can follow the recipe and reproduce the dish faithfully. Far fewer can change something and have the judges agree the result actually improved. NatureBench grades coding agents the same way: every task comes with a published best result, and you only score if you beat it.
This is a sharper question than it sounds, and it is the whole point. Most agent evaluations reward reproducing a known answer — and an agent can look very capable while doing nothing a human had not already done. NatureBench deliberately moves the goalposts from "can the agent re-derive this result?" to "can the agent advance it?" — and to make "advance" rigorous, a task counts as beaten only when the improvement clears an effect-size threshold of g > 0.1, so a tiny, noisy gain does not count as a discovery.
Making 90 paper-derived tasks into a fair test is its own engineering problem, which is what NatureGym solves: an automated pipeline that packages each source paper into a standardized, per-task containerized environment. Because every agent runs the same isolated setup, the scores are comparable and repeatable — a golden-case suite for scientific improvement rather than for chat.
Sit with the headline number. Across the 90 tasks, the strongest coding agent beat the published SOTA on only about 16 of them — 17.8%. On more than four-fifths of real, peer-reviewed problems, the best agent could not produce a genuine improvement over what humans had already published. That gap is the result that matters: it suggests today's agents are far readier to reproduce science than to advance it, and a benchmark that scores the second thing is what makes the gap visible.
| What the benchmark rewards | The bar to pass | What a high score proves |
|---|---|---|
| Reproduction (most agent evals) | match a known result | The agent can re-derive existing work — useful, but invents nothing new |
| Discovery (NatureBench) | beat published SOTA, effect size g > 0.1 [paper] | The agent advanced the state of the art — managed on only 17.8% of tasks |
Goes deeper in: AI Agents → Evals & Diagnostics → Pass/Fail vs Score
Related explainers
- Agent leaderboards — predictive validity — whether a benchmark score predicts real capability, the deeper question NatureBench's "beat SOTA" bar is built around.
- WeaveBench — trajectory-aware grading — another move to grade how an agent got there, not just the final number.
- Pantheon bench — HITL vs autonomous coding — a different lens on how good coding agents really are when measured carefully.