No Priors

Noam Brown on Why Reasoning Models Need Budget Curves — Interview, Explained

Noam Brown· Research Scientist at OpenAI at OpenAI
·~36 min·English·Other
ReasoningLLMAI Safety
TL;DR

Noam Brown argues that reasoning models must be evaluated as budget curves, because more test-time compute can unlock capabilities that static benchmark grids hide.

01Core Mental Model

Capability Is a Budget Curve

A reasoning model is no longer a fixed score. Its capability changes with the amount of test-time compute you spend.

The problem is we're in a world now where the capability of the model is a function of how much money you put into it.

Noam Brown, No Priors
Key Insight
Brown is reframing model evaluation from product comparison to resource allocation. The practical question becomes less “Which model is smarter?” and more “What capability do I buy at this budget?”

02Evaluation Design

The Grid Hides the X-Axis

Static benchmark grids collapse the most important variable. Brown wants scores plotted against tokens, cost, or time.

my claim is the proper way to evaluate the models now is you either have some kind of budget for the benchmark whether it's tokens or cost or time or whatever or you plot the performance as a function of the amount of test time compute that's going into the model

Noam Brown, No Priors
Key Insight
The grid format creates a bad equilibrium: every lab publishes it because every other lab does. Brown is trying to make it socially acceptable to lead with curves instead.

03Scaling Behavior

The Plateau Moved Out of Reach

“Run until it plateaus” stopped being practical. Modern models can keep improving for weeks on scaffolded tasks.

what we're seeing today with the modern models is that 5.5 and other models can think for if you scaffold them reasonably well, can think for weeks even um before having performance plateau on some of these benchmarks

Noam Brown, No Priors
Key Insight
The old stopping rule assumed the curve flattened quickly. Once the flat part moves beyond any reasonable release-cycle budget, evaluation needs explicit patience or cost limits.

04AI Safety

Safety Evals Need a Budget Dial

Dangerous-capability policies were built for fixed-capability models. Brown says they now need to specify the budget being tested.

At what budget should you evaluate these models? The policies that exist today don't really address that question

Noam Brown, No Priors
Key Insight
This is the safety mirror of the capability story. If useful capabilities scale with budget, unwanted capabilities can scale the same way, so “safe enough” only means something at a declared spend.

05Release Cycles

You Can Ship Before You Know the Ceiling

The model-release cycle is now shorter than the evaluation horizon. Nobody fully knows a model’s ceiling before the next model arrives.

nobody actually knows what the ceiling of capabilities are for these models because nobody's actually run them for long enough to really tell

Noam Brown, No Priors
Key Insight
This creates a permanent capability overhang. The shipped model may already contain abilities the market has not discovered because the discovery process itself takes weeks or months.

06Latent Capability

Scaffolding Turns Old Models Into New Systems

The same model can look qualitatively different when wrapped in a search process. Brown’s math example is really about orchestration, not just raw intelligence.

nobody had explored sufficiently what happens if I put $100,000 worth of compute into 5.5 what could it do?

Noam Brown, No Priors
Key Insight
For agent builders, this is the money quote: scaffolds are not cosmetic wrappers. They can convert a fixed model into a much larger search procedure with a different capability profile.

07Recursive Improvement

The Bottleneck Is Research Taste

AI is accelerating researchers unevenly. Optimization gets faster first; taste, agenda-setting, and time remain bottlenecks.

right now it's more about transforming what researchers do rather than fully replacing the researchers

Noam Brown, No Priors
Key Insight
Brown’s view is not “no takeoff.” It is “takeoff has friction.” The friction comes from parts of research that do not become 100x faster at the same time.

08Product Strategy

Routers Must Beat Thinking Longer

A routing layer only wins if it beats the same spend on one model thinking longer. Consensus is useful, but the budget accounting still applies.

once you control for the amount of test time compute is it is it actually still doing better? That's that's the question that you want to figure out

Noam Brown, No Priors
Key Insight
This is a clean product test for AI infrastructure vendors: compare orchestration against the strongest single-model baseline at equal cost, not against a cheaper baseline that makes the routing layer look good.