What is discovery-vs-reproduction agent benchmarking?

It is the distinction between testing whether an agent can re-derive a known result (reproduction) and whether it can produce a new, better one (discovery). NatureBench (arXiv 2606.24530, June 2026) is built for the second: each of its 90 tasks, drawn from peer-reviewed Nature-family papers, asks a coding agent to beat the paper's published state of the art, judged by an effect-size threshold of g > 0.1. On that bar the strongest agent succeeds on only 17.8% of tasks, exposing a large gap between reproducing science and advancing it.

Why does the 17.8% number matter?

Because it measures a capability people care about. Many agent benchmarks reward reproducing a known answer, which an agent can do while inventing nothing. NatureBench instead only credits an agent for measurably beating the published result, so its 17.8% success rate across 90 real, peer-reviewed problems suggests today's best coding agents are far readier to reproduce science than to advance it. The effect-size threshold (g > 0.1) is meant to screen out tiny, noisy gains.

NatureGym is the automated pipeline behind NatureBench. It turns each source paper into a standardized, per-task containerized environment — an isolated, self-contained setup with the code, data, and dependencies needed to run the task identically for every agent. That consistency is what makes the benchmark fair and its scores comparable across different agents and over time.

NatureBench: coding agents beat Nature-paper SOTA on just 17.8% of tasks — Discovery vs reproduction agent benchmarking

Jargon

SOTA (state of the art): The best published result on a task so far. NatureBench uses each source paper's reported SOTA as the bar an agent must beat, not just reach.
Reproduction vs discovery: Reproduction = re-deriving a known result; discovery = producing a new, better one. NatureBench is built to separate the two — agents reproduce far more readily than they advance.
Effect size (g): A standardized measure of how big an improvement is, independent of units. NatureBench counts a task as "beaten" only when the gain clears g > 0.1 — a real, non-trivial improvement rather than noise.
NatureGym: NatureBench's automated pipeline that turns each source paper into a standardized, per-task containerized environment, so every agent is evaluated under identical, reproducible conditions.
Containerized environment: A task packaged in an isolated, self-contained box (code, data, dependencies) that runs the same everywhere. It is what makes a benchmark fair and repeatable across agents.
Coding agent: An agent that writes and runs code to solve a task. NatureBench's tasks require one to implement and improve on a paper's method, not just chat about it.

The news. On June 23, 2026, researchers released NatureBench, a benchmark of 90 tasks sourced from peer-reviewed Nature-family papers. Each task asks a coding agent not to reproduce a result but to beat the paper's published state of the art, judged by an effect-size threshold (g > 0.1). The strongest model surpasses SOTA on only 17.8% of tasks, exposing a large gap between reproduction and genuine discovery. The tasks are built by NatureGym, an automated pipeline that turns each paper into a standardized, per-task containerized environment. Read the paper →

Picture a cook-off with a twist. The judges are not asking you to recreate the famous chef's signature dish — they already have it scored and on the board — they are asking whether what you plate is measurably better than the record. Plenty of cooks can follow the recipe and reproduce the dish faithfully. Far fewer can change something and have the judges agree the result actually improved. NatureBench grades coding agents the same way: every task comes with a published best result, and you only score if you beat it.

This is a sharper question than it sounds, and it is the whole point. Most agent evaluations reward reproducing a known answer — and an agent can look very capable while doing nothing a human had not already done. NatureBench deliberately moves the goalposts from "can the agent re-derive this result?" to "can the agent advance it?" — and to make "advance" rigorous, a task counts as beaten only when the improvement clears an effect-size threshold of g > 0.1, so a tiny, noisy gain does not count as a discovery.

Making 90 paper-derived tasks into a fair test is its own engineering problem, which is what NatureGym solves: an automated pipeline that packages each source paper into a standardized, per-task containerized environment. Because every agent runs the same isolated setup, the scores are comparable and repeatable — a golden-case suite for scientific improvement rather than for chat.

Sit with the headline number. Across the 90 tasks, the strongest coding agent beat the published SOTA on only about 16 of them — 17.8%. On more than four-fifths of real, peer-reviewed problems, the best agent could not produce a genuine improvement over what humans had already published. That gap is the result that matters: it suggests today's agents are far readier to reproduce science than to advance it, and a benchmark that scores the second thing is what makes the gap visible.

What the benchmark rewards	The bar to pass	What a high score proves
Reproduction (most agent evals)	match a known result	The agent can re-derive existing work — useful, but invents nothing new
Discovery (NatureBench)	beat published SOTA, effect size g > 0.1 [paper]	The agent advanced the state of the art — managed on only 17.8% of tasks

Goes deeper in: AI Agents → Evals & Diagnostics → Pass/Fail vs Score

Related explainers

Agent leaderboards — predictive validity — whether a benchmark score predicts real capability, the deeper question NatureBench's "beat SOTA" bar is built around.
WeaveBench — trajectory-aware grading — another move to grade how an agent got there, not just the final number.
Pantheon bench — HITL vs autonomous coding — a different lens on how good coding agents really are when measured carefully.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based