No Priors

Noam Brown on Why Reasoning Models Need Budget Curves — Interview, Explained

Noam Brown· Research Scientist at OpenAI at OpenAI

2026-06-26·~36 min·English·Other

ReasoningLLMAI Safety

TL;DR

Noam Brown argues that reasoning models must be evaluated as budget curves, because more test-time compute can unlock capabilities that static benchmark grids hide.

01Core Mental Model

Capability Is a Budget Curve

A reasoning model is no longer a fixed score. Its capability changes with the amount of test-time compute you spend.

Capability as a Function of Inference Budget

The problem is we're in a world now where the capability of the model is a function of how much money you put into it.
— Noam Brown, No Priors

Key Insight

Brown is reframing model evaluation from product comparison to resource allocation. The practical question becomes less “Which model is smarter?” and more “What capability do I buy at this budget?”

02Evaluation Design

The Grid Hides the X-Axis

Static benchmark grids collapse the most important variable. Brown wants scores plotted against tokens, cost, or time.

From One-Number Grids to Budget Curves

my claim is the proper way to evaluate the models now is you either have some kind of budget for the benchmark whether it's tokens or cost or time or whatever or you plot the performance as a function of the amount of test time compute that's going into the model
— Noam Brown, No Priors

Key Insight

The grid format creates a bad equilibrium: every lab publishes it because every other lab does. Brown is trying to make it socially acceptable to lead with curves instead.

03Scaling Behavior

The Plateau Moved Out of Reach

“Run until it plateaus” stopped being practical. Modern models can keep improving for weeks on scaffolded tasks.

Why Max-Settings Evaluation Breaks

what we're seeing today with the modern models is that 5.5 and other models can think for if you scaffold them reasonably well, can think for weeks even um before having performance plateau on some of these benchmarks
— Noam Brown, No Priors

Key Insight

The old stopping rule assumed the curve flattened quickly. Once the flat part moves beyond any reasonable release-cycle budget, evaluation needs explicit patience or cost limits.

04AI Safety

Safety Evals Need a Budget Dial

Dangerous-capability policies were built for fixed-capability models. Brown says they now need to specify the budget being tested.

The Missing Question in Preparedness Frameworks

At what budget should you evaluate these models? The policies that exist today don't really address that question
— Noam Brown, No Priors

Key Insight

This is the safety mirror of the capability story. If useful capabilities scale with budget, unwanted capabilities can scale the same way, so “safe enough” only means something at a declared spend.

05Release Cycles

You Can Ship Before You Know the Ceiling

The model-release cycle is now shorter than the evaluation horizon. Nobody fully knows a model’s ceiling before the next model arrives.

Release Cadence vs. Long-Horizon Evaluation

nobody actually knows what the ceiling of capabilities are for these models because nobody's actually run them for long enough to really tell
— Noam Brown, No Priors

Key Insight

This creates a permanent capability overhang. The shipped model may already contain abilities the market has not discovered because the discovery process itself takes weeks or months.

06Latent Capability

Scaffolding Turns Old Models Into New Systems

The same model can look qualitatively different when wrapped in a search process. Brown’s math example is really about orchestration, not just raw intelligence.

Direct Prompt vs. Scaffolded Search

nobody had explored sufficiently what happens if I put $100,000 worth of compute into 5.5 what could it do?
— Noam Brown, No Priors

Key Insight

For agent builders, this is the money quote: scaffolds are not cosmetic wrappers. They can convert a fixed model into a much larger search procedure with a different capability profile.

07Recursive Improvement

The Bottleneck Is Research Taste

AI is accelerating researchers unevenly. Optimization gets faster first; taste, agenda-setting, and time remain bottlenecks.

Why Speedups Do Not Instantly Replace Researchers

right now it's more about transforming what researchers do rather than fully replacing the researchers
— Noam Brown, No Priors

Key Insight

Brown’s view is not “no takeoff.” It is “takeoff has friction.” The friction comes from parts of research that do not become 100x faster at the same time.

08Product Strategy

Routers Must Beat Thinking Longer

A routing layer only wins if it beats the same spend on one model thinking longer. Consensus is useful, but the budget accounting still applies.

Routing vs. Longer Thinking Under the Same Budget

once you control for the amount of test time compute is it is it actually still doing better? That's that's the question that you want to figure out
— Noam Brown, No Priors

Key Insight

This is a clean product test for AI infrastructure vendors: compare orchestration against the strongest single-model baseline at equal cost, not against a cheaper baseline that makes the routing layer look good.