The news. On June 16, 2026, Weibo AI released VibeThinker-3B, a 3-billion-parameter dense model that posts 94.3 on AIME 2026 (rising to 97.1 with test-time scaling), 80.2 Pass@1 on LiveCodeBench v6, and a 96.1% acceptance rate on recent unseen LeetCode contests — yet trails much larger models on knowledge-heavy tests (70.2 on GPQA-Diamond). The recipe is the Spectrum-to-Signal Principle: a diversity-exploring fine-tuning phase that spreads many solution strategies, followed by max-entropy-guided RL that amplifies the correct ones. Read the paper →
Picture a seedbed. A patient gardener doesn't yank everything but the tallest sprout on day one — they let the whole bed grow, because you can't tell on day one which seedling will bear the prize fruit. Train a small reasoning model the same way and the lesson is identical: the model should explore many lines of attack on a hard problem before anyone decides which one was right. The danger is an impatient trainer who prunes too early — and the strategy you prune away might have been the only one that could crack the problem.
That impatient trainer is, in effect, ordinary RL post-training. The standard move for reasoning models is RLVR — sample an answer, check it with a verifier, and push the model toward whatever scored well. Run that loop hard and the model's output distribution narrows: it keeps replaying its few highest-reward moves and stops sampling the long-shot reasoning paths it used to try. That narrowing is entropy collapse, and it is exactly what robs a small model of the variety it depends on — a 3B core has little capacity to spare, so the rare correct strategy is precisely the one it can least afford to lose.
VibeThinker's fix is to split training into two moves it calls the Spectrum-to-Signal Principle. First comes the spectrum: a diversity-exploring supervised phase that deliberately seeds the model with many different ways to reach an answer, widening the distribution instead of sharpening it. Only then comes the signal: an RL stage built on MGPO — MaxEnt-Guided Policy Optimization, whose maximum-entropy term keeps rewarding the model for staying varied even as it amplifies the attempts that pass the verifier. The order is the whole trick — grow the spectrum first, then amplify the signal, so the bloom you keep was chosen from a full bed rather than the last sprout standing.
| Dimension | Standard RLVR / GRPO | Diversity-driven RL (SSP) |
|---|---|---|
| SFT goal | reach high single-answer accuracy | build a broad spectrum of solution strategies |
| RL objective | maximize reward (sharpens to a few modes) | max-entropy guided — amplify signal, keep variety |
| Diversity over training | collapses (entropy collapse) | preserved by design |
| Effect on a 3B model | prunes the rare correct path it needed | ~94 on AIME 2026 (paper) |
Why does keeping the spectrum wide actually move the score? Watch one AIME problem. Sample VibeThinker many times — say 16 attempts (illustrative) — and a diversity-preserved model spreads those tries across genuinely different lines of attack, where a collapsed model would hand you 16 near-copies of the same (possibly wrong) reasoning. That spread is what its pass@k buys, and test-time scaling is what cashes it in: Claim-Level Reliability Assessment scores the claims across those varied attempts and lets the reliable ones outvote the rest. Single-shot, VibeThinker already scores 94.3 on AIME 2026; let CLR aggregate over its diverse tries and that climbs to 97.1 — a +2.8-point lift that only exists because the attempts weren't carbon copies.
That same diversity story explains the model's lopsided scorecard, which the authors frame as the Parametric Compression-Coverage Hypothesis: verifiable reasoning compresses into a compact core, so a 3B model can hit 94.3 on AIME, while broad knowledge needs raw parameter coverage, so the same model lands at just 70.2 on GPQA-Diamond. The takeaway isn't "small models are secretly huge" — it's narrower and more useful: the bottleneck for small-model reasoning was never only size; it was throwing away diversity before the model could use it.
Goes deeper in: LLM Internals → Text Generation → Sampling
Related explainers
- CoPD paper — Reinforcement Learning with Verifiable Rewards (RLVR) — the standard verifier-graded RL loop that diversity-driven RL is reacting against
- DRPO — smooth trust-region penalty — another fix to RL post-training, but aimed at stability rather than diversity
- Non-monotonic pretrain distillation — the distillation side of squeezing big-model behavior into a smaller one