What is a defense-in-depth generative verifier?

It is the strict checker at the heart of MaxProof. A generative verifier reads a candidate proof and writes out a justified verdict rather than emitting a single valid/invalid bit. 'Defense in depth' means it layers several independent checks so a flawed proof has to beat all of them, which drives its false-positive rate — the rate at which it wrongly accepts a wrong proof — toward zero.

Why does MaxProof tune its verifier for a low false-positive rate?

Because at inference MaxProof samples many candidate proofs and selects a winner by tournament. If the verifier sometimes passes a wrong proof, sampling more candidates only multiplies the chances that a confident-but-wrong proof slips through and wins. A very low false-positive rate makes the opposite true: more samples mean more chances at a correct proof, while the strict verifier keeps the false passes rare.

How is MaxProof different from best-of-N with majority voting?

Majority voting picks the most common final answer, so a shared, plausible mistake can win the vote, and it never checks the reasoning. MaxProof instead trains one model to generate, verify, and repair proofs, then keeps only proofs a strict generative verifier approves and resolves ties with a tournament. It reports 35/42 on IMO 2025 and 36/42 on USAMO 2026, both above the human gold-medal threshold.

MaxProof clears IMO/USAMO gold — Defense-in-depth generative verifier

Jargon

Generative verifier: A model that doesn't just output a valid / invalid bit, but writes out a critique justifying the verdict — and can be RL-trained on that critique. In MaxProof, that generated critique is part of the trained verification behavior.
False-positive rate (FPR): The fraction of wrong proofs the verifier mistakenly stamps "valid." It is the load-bearing metric here: driving the FPR down is what makes scaling safe.
Defense in depth: Layering several independent checks so a flaw must beat all of them to slip through — the same idea the guardrails module calls defense-in-depth, here applied to proof verification.
Test-time scaling (population-level): Spending more compute at inference by sampling many candidate answers and selecting among them, instead of one greedy pass. Background: when to spend more reasoning budget.
Tournament selection: Choosing the best candidate by head-to-head comparisons rather than one absolute score — robust when the scores themselves are noisy.
Repair: A trained step that takes a flawed proof plus the verifier's critique and produces a corrected version, instead of discarding the near-miss.
Precision: Of all the proofs the verifier approved, the share that are actually correct: TP / (TP + FP). The worked example below turns the FPR into this number.

The news. On June 11, 2026, MiniMax released MaxProof (arXiv 2606.13473), a system that folds proof generation, verification, and repair into a single model trained with reinforcement learning. At inference it runs population-level test-time scaling — sampling many candidate proofs and selecting one through a tournament — and reports 35/42 on IMO 2025 and 36/42 on USAMO 2026, both above the human gold-medal threshold. Read the paper →

Picture a hall of math contestants, each handing in a proof, and one strict inspector standing at the door to the medal round. The inspector's job is not to be encouraging — it is to almost never wave through a flawed proof, even if that means occasionally failing a sound one. The survivors go into a knockout bracket and the best one wins gold. MaxProof is that hall, and the inspector is a trained model called a generative verifier.

Here is the part that flips intuition. If you want a hard proof, the obvious move is to generate more candidates — more shots on goal. But more candidates only helps if your inspector is strict. With a sloppy verifier that sometimes stamps a wrong proof "valid," generating more proofs makes things worse: every extra batch is another chance for a confident-but-wrong proof to fool the checker and win the bracket. The verifier, not the generator, is the bottleneck — which is why MaxProof spends its training budget on the inspector, not just the contestant.

So MaxProof trains one model to play all three roles — generate a proof, verify a proof, and repair a flawed one — with reinforcement learning. The verifier is tuned for defense in depth: layered checks that drive its false-positive rate toward zero, accepting that it will sometimes reject a correct proof. That trade is deliberate: a false negative just costs one wasted candidate, while a false positive poisons the whole tournament by crowning a wrong proof. Once the inspector is that strict, scaling is safe — sample a population, let the verifier cull it, repair the near-misses, and run a generate-then-check loop to pick the survivor.

Why a strict verifier is what makes scaling work

The arithmetic is easiest on a small (illustrative) batch — the paper reports headline scores, not per-problem verifier stats. Say you sample 100 candidate proofs for one hard problem and only 5 are actually correct. The verifier has to surface a winner.

A sloppy verifier with a 10% false-positive rate wrongly passes about 0.10 × 95 ≈ 9.5 of the 95 wrong proofs, plus the 5 real ones — so about 14.5 proofs clear the gate and only 5 / 14.5 ≈ 34% of "verified" proofs are truly correct. Scaling to 1,000 candidates at the same rate gives about 50 real proofs but roughly 95 false passes, so the approved pile stays mostly noise — more shots don't help.
A defense-in-depth verifier with a 0.5% false-positive rate wrongly passes about 0.005 × 95 ≈ 0.5 proofs, plus the 5 real ones — so about 5.5 clear the gate and 5 / 5.5 ≈ 91% of verified proofs are correct. Now sampling more strictly helps: more chances at a real proof, and the false passes stay near zero.

Same generator, same 100 candidates: dropping the false-positive rate from 10% to 0.5% flips verified-proof precision from ~34% to ~91% — and only then does spending more test-time compute pay off.

Four ways to pick a proof

Approach	How it picks a proof	Failure mode
Greedy single decode	one proof, no selection	wrong whenever that single proof is flawed
Best-of-N + majority vote	vote over final answers	a shared, plausible mistake can win the vote
Best-of-N + accuracy-tuned verifier	keep a verifier-approved proof	false positives multiply as N grows
MaxProof (defense-in-depth verifier + tournament)	strict layered verify, then head-to-head	false passes stay rare, so scaling helps

One honest caveat: a generative verifier is still a learned checker, not a sound formal proof system. "This proof is valid" remains a model's judgment — defense in depth shrinks its false-positive rate but does not drive it to a guaranteed zero, and the 35/42 and 36/42 figures are MiniMax's reported scores on those two exams, not a proof that the verifier is infallible.

Goes deeper in: AI Agents → Workflow Patterns → Evaluator-Optimizer

Related explainers

Reasoning Arena — Bradley-Terry trace ranking — another use of a tournament: rank tied reasoning traces head-to-head when a verifiable reward goes flat
RLVR — rewarding provably correct answers — the verifiable-reward training regime MaxProof builds its generate/verify/repair loop on top of
VPO vs GRPO — vector rewards — a different angle on giving RL a richer signal than a single scalar reward

Continue in trackAI Agents — Workflow Patterns: Evaluator-Optimizer

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based