Noam Brown on Why Reasoning Models Need Budget Curves — Interview, Explained
Noam Brown argues that reasoning models must be evaluated as budget curves, because more test-time compute can unlock capabilities that static benchmark grids hide.
Capability Is a Budget Curve
A reasoning model is no longer a fixed score. Its capability changes with the amount of test-time compute you spend.
The problem is we're in a world now where the capability of the model is a function of how much money you put into it.
The Grid Hides the X-Axis
Static benchmark grids collapse the most important variable. Brown wants scores plotted against tokens, cost, or time.
my claim is the proper way to evaluate the models now is you either have some kind of budget for the benchmark whether it's tokens or cost or time or whatever or you plot the performance as a function of the amount of test time compute that's going into the model
The Plateau Moved Out of Reach
“Run until it plateaus” stopped being practical. Modern models can keep improving for weeks on scaffolded tasks.
what we're seeing today with the modern models is that 5.5 and other models can think for if you scaffold them reasonably well, can think for weeks even um before having performance plateau on some of these benchmarks
Safety Evals Need a Budget Dial
Dangerous-capability policies were built for fixed-capability models. Brown says they now need to specify the budget being tested.
At what budget should you evaluate these models? The policies that exist today don't really address that question
You Can Ship Before You Know the Ceiling
The model-release cycle is now shorter than the evaluation horizon. Nobody fully knows a model’s ceiling before the next model arrives.
nobody actually knows what the ceiling of capabilities are for these models because nobody's actually run them for long enough to really tell
Scaffolding Turns Old Models Into New Systems
The same model can look qualitatively different when wrapped in a search process. Brown’s math example is really about orchestration, not just raw intelligence.
nobody had explored sufficiently what happens if I put $100,000 worth of compute into 5.5 what could it do?
The Bottleneck Is Research Taste
AI is accelerating researchers unevenly. Optimization gets faster first; taste, agenda-setting, and time remain bottlenecks.
right now it's more about transforming what researchers do rather than fully replacing the researchers
Routers Must Beat Thinking Longer
A routing layer only wins if it beats the same spend on one model thinking longer. Consensus is useful, but the budget accounting still applies.
once you control for the amount of test time compute is it is it actually still doing better? That's that's the question that you want to figure out