What is diversity-driven RL?

Diversity-driven RL is reinforcement-learning post-training that deliberately keeps a model's set of solution strategies varied instead of letting it narrow onto a few high-reward answers. VibeThinker-3B implements it as the Spectrum-to-Signal Principle: a diversity-exploring fine-tuning phase spreads many ways to solve a problem, then max-entropy-guided RL amplifies the correct ones without flattening that spread.

Why does keeping diversity help a small model?

A 3-billion-parameter model has little spare capacity, so the rare correct reasoning path is exactly the one it can least afford to prune. Standard RLVR tends to collapse the output distribution onto a few modes (entropy collapse), discarding that path. By preserving diversity, VibeThinker-3B keeps a high pass@k, which test-time scaling then converts into a higher final score — 94.3 to 97.1 on AIME 2026.

How does diversity-driven RL relate to standard RLVR?

RLVR grades each attempt with a deterministic verifier and pushes the model toward what scored well — effective, but prone to entropy collapse. Diversity-driven RL keeps RLVR's verifier-graded signal but adds a maximum-entropy objective (MGPO) and a diversity-first fine-tuning stage, so the model amplifies correct answers while staying varied rather than sharpening onto one.

VibeThinker-3B hits 94.3 on AIME 2026 — Diversity-driven RL

Jargon

Diversity-driven RL: Training that deliberately keeps a model's solutions varied rather than letting reinforcement learning narrow it to one favorite answer. The variety is treated as the resource that finds hard, rare solutions.
Spectrum-to-Signal Principle (SSP): VibeThinker's two-move recipe: first spread a wide spectrum of solution strategies during supervised fine-tuning, then amplify the correct ones with RL — without flattening the spectrum back down.
RLVR: Reinforcement Learning with Verifiable Rewards: grade each model attempt with a deterministic checker (a unit test, an equality check) and do gradient ascent on a 0-or-1 reward. Full explainer →
Entropy collapse: The common failure mode of RL post-training: as the model chases reward, its output distribution narrows onto a few modes, so it stops exploring and can no longer reach answers it used to find by chance.
MGPO: MaxEnt-Guided Policy Optimization — the RL objective VibeThinker uses. The maximum-entropy term pushes back against collapse, rewarding the policy for staying varied while it learns.
pass@k: The chance that at least one of k sampled attempts is correct. A diverse model has a high pass@k even when its single best guess (pass@1) is shaky — there's more room for a right answer to surface.
CLR: Claim-Level Reliability Assessment — a test-time scaling step that scores the claims inside many sampled attempts and lets the reliable ones outvote the rest, turning a diverse set of tries into one stronger answer.
Parametric Compression-Coverage Hypothesis: The paper's framing of why a 3B model can rival giants on math but trails on trivia: reasoning compresses into a compact core, while broad knowledge needs sheer parameter coverage.

The news. On June 16, 2026, Weibo AI released VibeThinker-3B, a 3-billion-parameter dense model that posts 94.3 on AIME 2026 (rising to 97.1 with test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and a 96.1% acceptance rate on recent unseen LeetCode contests — yet trails much larger models on knowledge-heavy tests (70.2 on GPQA-Diamond). The recipe is the Spectrum-to-Signal Principle: a diversity-exploring fine-tuning phase that spreads many solution strategies, followed by max-entropy-guided RL that amplifies the correct ones. Read the paper →

Picture a seedbed. A patient gardener doesn't yank everything but the tallest sprout on day one — they let the whole bed grow, because you can't tell on day one which seedling will bear the prize fruit. Train a small reasoning model the same way and the lesson is identical: the model should explore many lines of attack on a hard problem before anyone decides which one was right. The danger is an impatient trainer who prunes too early — and the strategy you prune away might have been the only one that could crack the problem.

That impatient trainer is, in effect, ordinary RL post-training. The standard move for reasoning models is RLVR — sample an answer, check it with a verifier, and push the model toward whatever scored well. Run that loop hard and the model's output distribution narrows: it keeps replaying its few highest-reward moves and stops sampling the long-shot reasoning paths it used to try. That narrowing is entropy collapse, and it is exactly what robs a small model of the variety it depends on — a 3B core has little capacity to spare, so the rare correct strategy is precisely the one it can least afford to lose.

VibeThinker's fix is to split training into two moves it calls the Spectrum-to-Signal Principle. First comes the spectrum: a diversity-exploring supervised phase that deliberately seeds the model with many different ways to reach an answer, widening the distribution instead of sharpening it. Only then comes the signal: an RL stage built on MGPO — MaxEnt-Guided Policy Optimization, whose maximum-entropy term keeps rewarding the model for staying varied even as it amplifies the attempts that pass the verifier. The order is the whole trick — grow the spectrum first, then amplify the signal, so the bloom you keep was chosen from a full bed rather than the last sprout standing.

Dimension	Standard RLVR / GRPO	Diversity-driven RL (SSP)
SFT goal	reach high single-answer accuracy	build a broad spectrum of solution strategies
RL objective	maximize reward (sharpens to a few modes)	max-entropy guided — amplify signal, keep variety
Diversity over training	collapses (entropy collapse)	preserved by design
Effect on a 3B model	prunes the rare correct path it needed	~94 on AIME 2026 (paper)

Why does keeping the spectrum wide actually move the score? Watch one AIME problem. Sample VibeThinker many times — say 16 attempts (illustrative) — and a diversity-preserved model spreads those tries across genuinely different lines of attack, where a collapsed model would hand you 16 near-copies of the same (possibly wrong) reasoning. That spread is what its pass@k buys, and test-time scaling is what cashes it in: Claim-Level Reliability Assessment scores the claims across those varied attempts and lets the reliable ones outvote the rest. Single-shot, VibeThinker already scores 94.3 on AIME 2026; let CLR aggregate over its diverse tries and that climbs to 97.1 — a +2.8-point lift that only exists because the attempts weren't carbon copies.

That same diversity story explains the model's lopsided scorecard, which the authors frame as the Parametric Compression-Coverage Hypothesis: verifiable reasoning compresses into a compact core, so a 3B model can hit 94.3 on AIME, while broad knowledge needs raw parameter coverage, so the same model lands at just 70.2 on GPQA-Diamond. The takeaway isn't "small models are secretly huge" — it's narrower and more useful: the bottleneck for small-model reasoning was never only size; it was throwing away diversity before the model could use it.

Goes deeper in: LLM Internals → Text Generation → Sampling

Related explainers

CoPD paper — Reinforcement Learning with Verifiable Rewards (RLVR) — the standard verifier-graded RL loop that diversity-driven RL is reacting against
DRPO — smooth trust-region penalty — another fix to RL post-training, but aimed at stability rather than diversity
Non-monotonic pretrain distillation — the distillation side of squeezing big-model behavior into a smaller one

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based