The news. On June 8, 2026, researchers at Cornell (Zhou et al.) released Reasoning Arena (arXiv 2606.09380). It tackles a failure mode of RL with verifiable rewards: when every sampled answer in a group gets the same reward, there is no learning signal, even though the reasoning behind those answers varies. Reasoning Arena routes those tied groups to a judge that runs pairwise trace comparisons against a dynamic pool of anchor traces, then fits a Bradley-Terry model over the resulting comparison graph. The paper reports +7.6% average on competition math and coding, 27–41% faster training, and nearly 50% less generation compute. Read the paper →

Picture a chess club running a practice round where the rule is simply "did you avoid a blunder?" If everyone played a clean game, the scoreboard reads all wins — a perfect tie — and it tells you nothing about who actually played the strongest chess. To rank them you need them to play each other: a few head-to-head games are enough for a rating system to sort the whole club. That is the exact move Reasoning Arena makes, and the club is a group of an LLM's reasoning attempts.

Here is the setup it fixes. Modern reasoning models are trained with RLVR: the model samples several answers to a problem, an automatic checker marks each one right or wrong, and GRPO nudges the model toward the answers that beat the group's average. The catch lives in that phrase beat the average. GRPO's signal is relative — a sample's advantage is its reward minus the group mean. When a problem is easy enough that all the sampled answers pass, every reward is identical, the mean equals each of them, and every advantage collapses to zero. The group produced several perfectly good reasoning traces and the optimizer learns nothing from any of them.

So Reasoning Arena adds a second opinion exactly where the verifier goes silent. For a tied group, a judge model plays the traces off against each other — but not in an exhaustive round-robin, which would cost a comparison for every pair. Each trace is compared against a small, dynamically updated anchor pool of earlier traces, the way a club member ranks up by playing a handful of opponents rather than the entire roster. Those sparse wins and losses feed a Bradley-Terry model, the same statistic that underlies chess ratings, which reconstructs a full numeric ranking from an incomplete comparison graph. That ranking becomes the relative reward GRPO was missing — so the tied group teaches the model again instead of being thrown away.

Where the signal disappears

The arithmetic is easiest to see on a small, illustrative group — say 8 rollouts on one math problem (the paper reports only end results, not per-group numbers). Suppose the checker marks 6 correct and 2 wrong: rewards are {1,1,1,1,1,1,0,0}, the group mean is 0.75, so the six correct traces each get an advantage of +0.25 and the two wrong ones −0.75. There is plenty of signal. Now suppose all 8 pass: rewards are {1,1,1,1,1,1,1,1}, the mean is 1.0, and every advantage is 1.0 − 1.0 = 0. Eight correct, carefully-reasoned rollouts, and the gradient from the entire group is zero — wasted compute, and a tie that tends to grow more common as the model gets stronger. Reasoning Arena re-spreads those tied traces into a ranking, so the best-reasoned one earns a positive advantage and the sloppiest a negative one, and there is something to learn from again.

How a tied group gets a signal back

ApproachWhat happens on an all-correct groupJudge costNote
Plain GRPO (baseline)identical rewards → zero advantage → no gradientnonethe group is discarded
Harder problems / curriculumhopes future groups won't tie; the tied group is still lostnonedoesn't recover this group
All-pairs judge rankingcompares every trace to every other to rank them~O(n²) comparisonsaccurate but expensive
Reasoning Arenaranks via a few anchor-pool matchups + Bradley-Terrysparse, < all-pairs~50% less generation compute (arXiv 2606.09380)

One caveat worth keeping: the judge's pairwise verdicts are a learned, fallible signal in a way the original verifier is not — a math answer either checks out or it doesn't, but "which trace reasoned better" is an opinion. The paper's contribution is showing that a cheap, sparse tournament can extract a usable gradient from groups RLVR would otherwise waste; the +7.6% and 27–41% faster training are the authors' reported figures on their math-and-coding suite, not a guarantee that judge-ranked rewards are free of the bias every learned reward model carries.

Goes deeper in: AI Agents → Evals & Diagnostics → Pass/Fail vs Score

Related explainers

Continue in trackAI Agents — Evals & Diagnostics: Pass/Fail vs Score

Frequently Asked Questions