The news. On June 11, 2026, MiniMax released MaxProof (arXiv 2606.13473), a system that folds proof generation, verification, and repair into a single model trained with reinforcement learning. At inference it runs population-level test-time scaling — sampling many candidate proofs and selecting one through a tournament — and reports 35/42 on IMO 2025 and 36/42 on USAMO 2026, both above the human gold-medal threshold. Read the paper →

Picture a hall of math contestants, each handing in a proof, and one strict inspector standing at the door to the medal round. The inspector's job is not to be encouraging — it is to almost never wave through a flawed proof, even if that means occasionally failing a sound one. The survivors go into a knockout bracket and the best one wins gold. MaxProof is that hall, and the inspector is a trained model called a generative verifier.

Here is the part that flips intuition. If you want a hard proof, the obvious move is to generate more candidates — more shots on goal. But more candidates only helps if your inspector is strict. With a sloppy verifier that sometimes stamps a wrong proof "valid," generating more proofs makes things worse: every extra batch is another chance for a confident-but-wrong proof to fool the checker and win the bracket. The verifier, not the generator, is the bottleneck — which is why MaxProof spends its training budget on the inspector, not just the contestant.

So MaxProof trains one model to play all three roles — generate a proof, verify a proof, and repair a flawed one — with reinforcement learning. The verifier is tuned for defense in depth: layered checks that drive its false-positive rate toward zero, accepting that it will sometimes reject a correct proof. That trade is deliberate: a false negative just costs one wasted candidate, while a false positive poisons the whole tournament by crowning a wrong proof. Once the inspector is that strict, scaling is safe — sample a population, let the verifier cull it, repair the near-misses, and run a generate-then-check loop to pick the survivor.

Why a strict verifier is what makes scaling work

The arithmetic is easiest on a small (illustrative) batch — the paper reports headline scores, not per-problem verifier stats. Say you sample 100 candidate proofs for one hard problem and only 5 are actually correct. The verifier has to surface a winner.

  • A sloppy verifier with a 10% false-positive rate wrongly passes about 0.10 × 95 ≈ 9.5 of the 95 wrong proofs, plus the 5 real ones — so about 14.5 proofs clear the gate and only 5 / 14.5 ≈ 34% of "verified" proofs are truly correct. Scaling to 1,000 candidates at the same rate gives about 50 real proofs but roughly 95 false passes, so the approved pile stays mostly noise — more shots don't help.
  • A defense-in-depth verifier with a 0.5% false-positive rate wrongly passes about 0.005 × 95 ≈ 0.5 proofs, plus the 5 real ones — so about 5.5 clear the gate and 5 / 5.5 ≈ 91% of verified proofs are correct. Now sampling more strictly helps: more chances at a real proof, and the false passes stay near zero.

Same generator, same 100 candidates: dropping the false-positive rate from 10% to 0.5% flips verified-proof precision from ~34% to ~91% — and only then does spending more test-time compute pay off.

Four ways to pick a proof

ApproachHow it picks a proofFailure mode
Greedy single decodeone proof, no selectionwrong whenever that single proof is flawed
Best-of-N + majority votevote over final answersa shared, plausible mistake can win the vote
Best-of-N + accuracy-tuned verifierkeep a verifier-approved prooffalse positives multiply as N grows
MaxProof (defense-in-depth verifier + tournament)strict layered verify, then head-to-headfalse passes stay rare, so scaling helps

One honest caveat: a generative verifier is still a learned checker, not a sound formal proof system. "This proof is valid" remains a model's judgment — defense in depth shrinks its false-positive rate but does not drive it to a guaranteed zero, and the 35/42 and 36/42 figures are MiniMax's reported scores on those two exams, not a proof that the verifier is infallible.

Goes deeper in: AI Agents → Workflow Patterns → Evaluator-Optimizer

Related explainers

Continue in trackAI Agents — Workflow Patterns: Evaluator-Optimizer

Frequently Asked Questions