Dwarkesh Patel Podcast

Grant Sanderson on Why Math's Hardest Work Resists the Benchmark

Grant Sanderson· Creator of 3Blue1Brown at 3Blue1Brown

2026-06-30·~94 min·English·Dwarkesh Patel

ReasoningLLM

TL;DR

AI is racing through math because math is verifiable and grindable, but asking the right question, coining the right definition, and drawing the improbable connection resist every benchmark, so the mathematician's job shifts toward curation.

01Core Mental Model

The Spiky, Fractal Frontier

AI capability isn't a rising tide, it's a jagged frontier, and math sits on one of the tallest spikes, which zooms in to reveal still more spikes.

The capability frontier stays jagged as you zoom into math

there's a spiky frontier to AI, and math is just right there in one of the spikes. But there's a fractal nature to that spikiness
— Grant Sanderson, Dwarkesh Patel Podcast

Key Insight

The concrete tell lives inside math itself: models cold-solve International Math Olympiad geometry in about nineteen seconds because it is brute-forceable, yet stumble on playful combinatorics. That makes "is AI good at math?" the wrong question, because the frontier is fractal and the honest answer is always "at which sub-task?"

02Shapes of a Breakthrough

Lightning Bolts, Mountains, and Raw Hustle

A hard proof can arrive in three shapes, a lightning bolt between two known fields, a whole new mountain of theory, or a brute-force slog, and only the first is easy for humans to digest.

Lightning bolt vs. mountain vs. raw hustle, ranked by how digestible the result is

If the character of it is mountain building, you have to put in a lot more time to understand that new mountain that was built, because it's a new thread, not just a lightning bolt between them.
— Grant Sanderson, Dwarkesh Patel Podcast

Key Insight

The shape determines whether a solved problem also advances understanding. A lightning bolt, like the Erdos-1196 result that borrowed a Markov-chain idea, is legible the moment you see its two endpoints. A mountain, like the disputed abc-conjecture attempt, can take years to climb only to reveal it was wrong at the summit.

03The Value Hierarchy

Theorems Are Cheap; Definitions Are Priceless

Proving theorems is the entry tier of math, coining conjectures is rarer, and inventing the right definitions is the summit, which is exactly the part you can't turn into a benchmark.

Prove, conjecture, define, in rising order of value and falling benchmarkability

good mathematicians prove theorems, great mathematicians come up with conjectures, and the greatest mathematicians come up with definitions.
— Grant Sanderson, Dwarkesh Patel Podcast

Key Insight

Sanderson is citing a line surfaced in a Polylog video, not his own, but it frames his whole worry. He argues there is no fundamental difference between a benchmark and a training environment, so the very tasks you can't score, was that a good conjecture, was that the right definition, are also the tasks current reinforcement learning can't train for.

04Why RLVR Hits a Wall

Galois and the Hundred-Year Verification Loop

Group theory took roughly a century to be recognized as valuable, and the reward signal of the day, the academy, literally rejected the teenager who invented it.

From Lagrange's hunch to Gell-Mann's quarks, the value of group theory resolved over generations

So again, thinking about verifiable reward, the verifier function that is the academy at that time is rejecting what he wrote.
— Grant Sanderson, Dwarkesh Patel Podcast

Key Insight

This is the structural ceiling on reinforcement learning from verifiable rewards. If an idea's correctness only resolves decades later, after cryptography and particle physics happen to need it, no training loop with a fast reward signal can learn to value it in the moment. The deepest work has a reward horizon that outruns any benchmark.

05Why Math Moves Fastest

Verifiable Isn't Enough, It Has to Be Grindable

Math and code race ahead not just because answers are checkable, but because you can spin up thousands of deterministic parallel attempts, while computer use is checkable yet ungrindable, so it crawls.

Verifiable by grindable: only math and code sit in the fast quadrant

It's not just verifiability; it has to be grindable.
— Dwarkesh Patel, Dwarkesh Patel Podcast

Key Insight

Grindability is really solved credit assignment. Containerize a repo, launch hundreds of identical rollouts, and the diff between the run that succeeds and the run that fails is unambiguous. A checkout flow on Amazon can't be replayed a thousand times because bot detectors and a moving world break the loop, which is why checkable computer use still trails math and code.

06The Architectural Limit

Autoregression Is a Strange Way to Think

A model with superhuman breadth still misses the connection between two fields it has mastered, because the very connection worth making is, by construction, an unlikely next token.

Next-token prediction down-ranks the cross-field leap where the substance lives

But the connection where all the substance is going to come from is, by its nature, a very unlikely one.
— Grant Sanderson, Dwarkesh Patel Podcast

Key Insight

This reframes the "why can't it just connect the dots?" complaint. The model isn't missing knowledge, it holds quantum physics and analytic number theory in the same weights. It's that the improbable cross-field leap is precisely what next-token prediction is trained to down-rank. Sanderson's bet is that the fix lives in the data and the environment, not in the temperature knob.

07The Proposed Fix

Don't Let All Your LLMs Be Einstein

Since autoregression collapses toward one path, the leverage is to inject diversity above the model, fanning out agents with deliberately opposed goals and biases, one proving, one disproving.

A single collapsing chain vs. a fan of agents given deliberately opposed contexts

You want to make sure you don't accidentally have all your LLMs be Einstein, because you might halt progress on quantum mechanics.
— Grant Sanderson, Dwarkesh Patel Podcast

Key Insight

The underrated edge of digital minds isn't raw intelligence, it's that you can parallelize them, spawn identical copies, and engineer their context. Give one agent Einstein's reference-frames bias and another the opposite, then survey which heuristics pay off, instead of hoping a lone genius stumbles on the right frame over lunch and dies in a duel before sharing it.

08The Human Element

From Theorem-Prover to Museum Curator

Sanderson once thought AI would prove theorems and leave humans to explain them, now he expects AI to explain better too, leaving humans the role of curator deciding what's worth understanding at all.

As AI takes proving and explaining, the durable human role becomes curation

One interesting take that I've heard about what mathematicians will end up being is that it's actually more analogous to art museum curators than anything else.
— Grant Sanderson, Dwarkesh Patel Podcast

Key Insight

The defensible human job isn't correctness or clarity, machines win both, it's the social act of curation. Motivation, Sanderson argues, is a social phenomenon: we follow curators we trust. It's why he rates teaching among the most stable post-AGI careers, and why his advice to would-be mathematicians is coldly practical: know where the money comes from and what value you actually add.