VPO paper — Vector-reward advantage vs GRPO scalar collapse
LLMThe news. On May 21, 2026, a research paper introduced Vector Policy Optimization (VPO) — a drop-in replacement for GRPO's advantage estimator. Across four tasks the authors report VPO matches or beats the strongest scalar RL baselines on pass@k and best@k, and the advantage widens as the search budget grows. The paper's bigger framing argument: "As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective."
Picture a new restaurant trying to win a Michelin star. The chef's first instinct is to drill one signature dish until it's flawless — a single tasting note, executed to the average critic's exact preference. That instinct is exactly GRPO: collapse every signal into one scalar, push every rollout toward whatever wins on that scalar, and stop when the chef can't tell two plates apart. The problem isn't the technique — it's the assumption that there is one critic. The actual reviewers come from different magazines, on different nights, with different tastes. A restaurant that only knows how to make one perfect plate is going to lose on three of the four nights.
A vector reward is the honest version of this: the reward isn't a scalar, it's a tuple. One axis per unit test in a coding benchmark. One axis per persona in a chat eval. One axis per evaluator model in an LLM-judge ensemble. Scalar RL throws all that structure away — it averages, weights, or hand-tunes the vector into a single number, then optimizes that. The policy converges on whatever response wins the average. Now sample eight rollouts from your post-trained model and ask "did any of them pass test case #3?" If your post-trained distribution has collapsed, the eight rollouts are near-duplicates of the average winner. The search has eight tries and zero diversity to search over.
VPO fixes the loss function at exactly the spot the collapse happens: the advantage estimator. GRPO normalizes the scalar reward within a group of rollouts and uses that as the advantage. VPO keeps the per-objective structure of the reward — each rollout gets credited for which objective it specifically won on, not for its average score — and the resulting gradient pushes different rollouts toward different specializations. The training pass looks almost identical to GRPO from the outside: same rollouts, same policy network, same compute budget. The difference is internal: where GRPO collapses, VPO preserves.
Where the wall-clock difference actually shows up
Hold three variables fixed. Three reward axes — say, three unit tests in a coding task. Eight rollouts sampled at inference time. Same post-trained model size and same training budget as a GRPO baseline. With GRPO and a scalar mean across the three axes, the post-trained policy converges to the rollout that wins the average — let's say it nails test 1 and test 2 but fails test 3. Sample eight rollouts from that policy and they're near-duplicates — same nailed/nailed/failed pattern across all eight. Run the test-time search: pass@8 ≈ pass@1 ≈ 28% (illustrative). Now retrain with VPO. The advantage estimator credits rollouts that specialize on test 3, so the post-trained policy keeps probability mass on responses that nail test 3 (at the cost of being weaker on tests 1 and 2). Sample eight rollouts and they spread across all three reward peaks — some nail tests 1+2, some nail test 3, some land between. The search now has eight distinct shots and pass@8 climbs to 71% (illustrative) — a ~2.5× lift on the same model, same compute budget, just by not throwing away the vector structure of the reward.
| Property | GRPO | VPO |
|---|---|---|
| Reward representation | Scalar (one number) | Vector (one per objective) |
| Advantage signal | Group-normalized scalar | Per-objective advantage preserved |
| Post-training entropy | Collapses to one mode | Maintained across multiple modes |
| Sampling at inference | Near-duplicate rollouts ~varies by reward structure | Specialized rollouts ~varies by reward structure |
| pass@k scaling | Plateaus quickly ~setup-dependent, illustrative | Climbs with k ~varies by paper, illustrative |
| Sweet spot | Single-objective tasks | Test-time search, evolutionary search |
This is the same lever the temperature and top-K samplers try to pull at decode time — except those are bolted on after training, with no information about which objective each rollout specialized in. Decode-time diversity hacks can only re-shuffle the post-trained distribution; they can't put probability mass back where training already collapsed it. VPO moves the diversity question upstream into the post-training objective itself.
The paper's bigger claim is the one in the notable quote: as test-time search becomes more standardized — AlphaEvolve-style evolution, best-of-N sampling, agentic planning that retries — preserving diversity stops being a hyperparameter and starts being a post-training objective. The same kind of methodological shift that moved the field from RLHF to RLVR: once an inference-time procedure becomes load-bearing, the post-training step has to be redesigned around what that procedure actually needs from the policy.
Goes deeper in: LLM Internals → Text Generation → Sampling
Related explainers
- RLVR — Reinforcement Learning with Verifiable Rewards — the post-training stage VPO replaces the advantage estimator inside
- CoPD — co-evolving policy distillation — alternative answer to "how do we get specialized post-training without collapse"
- PPOW — window-level RL for speculative drafters — different fix for credit-assignment problems in long-rollout RL
- TIM — Training-Inference Mismatch — related diagnostic on what goes wrong when the training and inference distributions drift apart