The news. On June 23, 2026, researchers released NatureBench, a benchmark of 90 tasks sourced from peer-reviewed Nature-family papers. Each task asks a coding agent not to reproduce a result but to beat the paper's published state of the art, judged by an effect-size threshold (g > 0.1). The strongest model surpasses SOTA on only 17.8% of tasks, exposing a large gap between reproduction and genuine discovery. The tasks are built by NatureGym, an automated pipeline that turns each paper into a standardized, per-task containerized environment. Read the paper →

Picture a cook-off with a twist. The judges are not asking you to recreate the famous chef's signature dish — they already have it scored and on the board — they are asking whether what you plate is measurably better than the record. Plenty of cooks can follow the recipe and reproduce the dish faithfully. Far fewer can change something and have the judges agree the result actually improved. NatureBench grades coding agents the same way: every task comes with a published best result, and you only score if you beat it.

This is a sharper question than it sounds, and it is the whole point. Most agent evaluations reward reproducing a known answer — and an agent can look very capable while doing nothing a human had not already done. NatureBench deliberately moves the goalposts from "can the agent re-derive this result?" to "can the agent advance it?" — and to make "advance" rigorous, a task counts as beaten only when the improvement clears an effect-size threshold of g > 0.1, so a tiny, noisy gain does not count as a discovery.

Making 90 paper-derived tasks into a fair test is its own engineering problem, which is what NatureGym solves: an automated pipeline that packages each source paper into a standardized, per-task containerized environment. Because every agent runs the same isolated setup, the scores are comparable and repeatable — a golden-case suite for scientific improvement rather than for chat.

Sit with the headline number. Across the 90 tasks, the strongest coding agent beat the published SOTA on only about 16 of them — 17.8%. On more than four-fifths of real, peer-reviewed problems, the best agent could not produce a genuine improvement over what humans had already published. That gap is the result that matters: it suggests today's agents are far readier to reproduce science than to advance it, and a benchmark that scores the second thing is what makes the gap visible.

What the benchmark rewardsThe bar to passWhat a high score proves
Reproduction (most agent evals)match a known resultThe agent can re-derive existing work — useful, but invents nothing new
Discovery (NatureBench)beat published SOTA, effect size g > 0.1 [paper]The agent advanced the state of the art — managed on only 17.8% of tasks

Goes deeper in: AI Agents → Evals & Diagnostics → Pass/Fail vs Score

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based