VPO paper — Vector-reward advantage vs GRPO scalar collapse

LLM
L
VPO vs GRPO — vector reward preserves policy diversity for test-time search
learnaivisually.com/ai-explained/vpo-vector-reward-vs-grpo

The news. On May 21, 2026, a research paper introduced Vector Policy Optimization (VPO) — a drop-in replacement for GRPO's advantage estimator. Across four tasks the authors report VPO matches or beats the strongest scalar RL baselines on pass@k and best@k, and the advantage widens as the search budget grows. The paper's bigger framing argument: "As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective."

Picture a new restaurant trying to win a Michelin star. The chef's first instinct is to drill one signature dish until it's flawless — a single tasting note, executed to the average critic's exact preference. That instinct is exactly GRPO: collapse every signal into one scalar, push every rollout toward whatever wins on that scalar, and stop when the chef can't tell two plates apart. The problem isn't the technique — it's the assumption that there is one critic. The actual reviewers come from different magazines, on different nights, with different tastes. A restaurant that only knows how to make one perfect plate is going to lose on three of the four nights.

A vector reward is the honest version of this: the reward isn't a scalar, it's a tuple. One axis per unit test in a coding benchmark. One axis per persona in a chat eval. One axis per evaluator model in an LLM-judge ensemble. Scalar RL throws all that structure away — it averages, weights, or hand-tunes the vector into a single number, then optimizes that. The policy converges on whatever response wins the average. Now sample eight rollouts from your post-trained model and ask "did any of them pass test case #3?" If your post-trained distribution has collapsed, the eight rollouts are near-duplicates of the average winner. The search has eight tries and zero diversity to search over.

VPO fixes the loss function at exactly the spot the collapse happens: the advantage estimator. GRPO normalizes the scalar reward within a group of rollouts and uses that as the advantage. VPO keeps the per-objective structure of the reward — each rollout gets credited for which objective it specifically won on, not for its average score — and the resulting gradient pushes different rollouts toward different specializations. The training pass looks almost identical to GRPO from the outside: same rollouts, same policy network, same compute budget. The difference is internal: where GRPO collapses, VPO preserves.

Where the wall-clock difference actually shows up

Hold three variables fixed. Three reward axes — say, three unit tests in a coding task. Eight rollouts sampled at inference time. Same post-trained model size and same training budget as a GRPO baseline. With GRPO and a scalar mean across the three axes, the post-trained policy converges to the rollout that wins the average — let's say it nails test 1 and test 2 but fails test 3. Sample eight rollouts from that policy and they're near-duplicates — same nailed/nailed/failed pattern across all eight. Run the test-time search: pass@8 ≈ pass@1 ≈ 28% (illustrative). Now retrain with VPO. The advantage estimator credits rollouts that specialize on test 3, so the post-trained policy keeps probability mass on responses that nail test 3 (at the cost of being weaker on tests 1 and 2). Sample eight rollouts and they spread across all three reward peaks — some nail tests 1+2, some nail test 3, some land between. The search now has eight distinct shots and pass@8 climbs to 71% (illustrative) — a ~2.5× lift on the same model, same compute budget, just by not throwing away the vector structure of the reward.

PropertyGRPOVPO
Reward representationScalar (one number)Vector (one per objective)
Advantage signalGroup-normalized scalarPer-objective advantage preserved
Post-training entropyCollapses to one modeMaintained across multiple modes
Sampling at inferenceNear-duplicate rollouts ~varies by reward structureSpecialized rollouts ~varies by reward structure
pass@k scalingPlateaus quickly ~setup-dependent, illustrativeClimbs with k ~varies by paper, illustrative
Sweet spotSingle-objective tasksTest-time search, evolutionary search

This is the same lever the temperature and top-K samplers try to pull at decode time — except those are bolted on after training, with no information about which objective each rollout specialized in. Decode-time diversity hacks can only re-shuffle the post-trained distribution; they can't put probability mass back where training already collapsed it. VPO moves the diversity question upstream into the post-training objective itself.

The paper's bigger claim is the one in the notable quote: as test-time search becomes more standardized — AlphaEvolve-style evolution, best-of-N sampling, agentic planning that retries — preserving diversity stops being a hyperparameter and starts being a post-training objective. The same kind of methodological shift that moved the field from RLHF to RLVR: once an inference-time procedure becomes load-bearing, the post-training step has to be redesigned around what that procedure actually needs from the policy.

Goes deeper in: LLM Internals → Text Generation → Sampling

Related explainers

Frequently Asked Questions