What is Vector Policy Optimization (VPO)?

VPO is a drop-in replacement for the GRPO advantage estimator used in modern LLM RL post-training. Instead of collapsing the reward into a single scalar before computing the advantage, VPO keeps the reward as a vector across its component objectives — per-test-case correctness, multiple personas, multiple reward models — and trains different rollouts to specialize in different reward dimensions. The resulting policy stays diverse, so inference-time search procedures like pass@k and best@k keep climbing as the search budget grows. The paper reports VPO matches or beats the strongest scalar RL baselines on pass@k and best@k across four tasks, with the advantage widening as the search budget grows.

How does VPO differ from GRPO?

GRPO normalizes the scalar reward within a group of rollouts and uses that as the advantage — the policy converges to whatever response wins the average score. VPO preserves the per-objective structure of the reward: each rollout gets credited for the specific objective it specialized in, not for its average. Same training rollouts, same compute budget, but the post-trained policy holds onto multimodal solution distributions instead of collapsing them. The training pass looks almost identical from the outside; the difference is internal — where GRPO collapses, VPO preserves.

When does VPO matter most?

VPO matters most when inference-time search is the load-bearing piece of the system. Best-of-N sampling, AlphaEvolve-style evolutionary search, agentic planning with retries — anything where you sample many rollouts and pick the best one — only pays off if the rollouts are actually distinct. If the post-trained policy has collapsed, those k samples are near-duplicates and the search budget is wasted. For single-objective tasks where you only ever sample once, scalar RL is fine; VPO's advantage widens specifically as k grows. The paper goes further and claims that for evolutionary search, VPO unlocks problems that GRPO models cannot solve at all.

VPO paper — Vector-reward advantage vs GRPO scalar collapse

LLM

learnaivisually.com/ai-explained/vpo-vector-reward-vs-grpo

Jargon

VPO: Vector Policy Optimization. A drop-in replacement for the GRPO advantage estimator that treats reward as a vector across multiple objectives (per-test-case correctness, multiple personas, multiple reward models) and trains the policy to specialize across them.
GRPO: Group Relative Policy Optimization. A common advantage estimator for modern LLM RL: it collapses the reward into one scalar, normalizes it within a group of rollouts, and uses that as the advantage signal. The VPO paper positions its vector estimator as a drop-in replacement at exactly this step.
advantage estimator: The piece of an RL algorithm that turns raw rewards into the gradient signal the policy is pushed up against. GRPO normalizes inside a group of rollouts; VPO preserves the per-objective structure.
vector-valued reward: A reward signal that returns multiple numbers, one per objective — e.g. one bit per unit test, or one score per evaluator persona. The natural shape in practice, which scalar RL throws away by averaging.
entropy collapse: The failure mode where the post-trained policy puts all its probability on one type of response. Sampling more rollouts just produces near-duplicates — there is nothing left for an inference-time search procedure to exploit.
pass@k / best@k: Inference-time search metrics: sample k rollouts, then ask "did any pass?" (pass@k) or "what's the best score?" (best@k). They only climb with k if the policy stays diverse — otherwise the k samples are near-duplicates.
RLVR: Reinforcement Learning with Verifiable Rewards — a widely-used post-training setup for reasoning models, where a deterministic checker grades each rollout 0 or 1. Full explainer →

The news. On May 21, 2026, a research paper introduced Vector Policy Optimization (VPO) — a drop-in replacement for GRPO's advantage estimator. Across four tasks the authors report VPO matches or beats the strongest scalar RL baselines on pass@k and best@k, and the advantage widens as the search budget grows. The paper's bigger framing argument: "As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective."

Picture a new restaurant trying to win a Michelin star. The chef's first instinct is to drill one signature dish until it's flawless — a single tasting note, executed to the average critic's exact preference. That instinct is exactly GRPO: collapse every signal into one scalar, push every rollout toward whatever wins on that scalar, and stop when the chef can't tell two plates apart. The problem isn't the technique — it's the assumption that there is one critic. The actual reviewers come from different magazines, on different nights, with different tastes. A restaurant that only knows how to make one perfect plate is going to lose on three of the four nights.

A vector reward is the honest version of this: the reward isn't a scalar, it's a tuple. One axis per unit test in a coding benchmark. One axis per persona in a chat eval. One axis per evaluator model in an LLM-judge ensemble. Scalar RL throws all that structure away — it averages, weights, or hand-tunes the vector into a single number, then optimizes that. The policy converges on whatever response wins the average. Now sample eight rollouts from your post-trained model and ask "did any of them pass test case #3?" If your post-trained distribution has collapsed, the eight rollouts are near-duplicates of the average winner. The search has eight tries and zero diversity to search over.

VPO fixes the loss function at exactly the spot the collapse happens: the advantage estimator. GRPO normalizes the scalar reward within a group of rollouts and uses that as the advantage. VPO keeps the per-objective structure of the reward — each rollout gets credited for which objective it specifically won on, not for its average score — and the resulting gradient pushes different rollouts toward different specializations. The training pass looks almost identical to GRPO from the outside: same rollouts, same policy network, same compute budget. The difference is internal: where GRPO collapses, VPO preserves.

Where the wall-clock difference actually shows up

Hold three variables fixed. Three reward axes — say, three unit tests in a coding task. Eight rollouts sampled at inference time. Same post-trained model size and same training budget as a GRPO baseline. With GRPO and a scalar mean across the three axes, the post-trained policy converges to the rollout that wins the average — let's say it nails test 1 and test 2 but fails test 3. Sample eight rollouts from that policy and they're near-duplicates — same nailed/nailed/failed pattern across all eight. Run the test-time search: pass@8 ≈ pass@1 ≈ 28% (illustrative). Now retrain with VPO. The advantage estimator credits rollouts that specialize on test 3, so the post-trained policy keeps probability mass on responses that nail test 3 (at the cost of being weaker on tests 1 and 2). Sample eight rollouts and they spread across all three reward peaks — some nail tests 1+2, some nail test 3, some land between. The search now has eight distinct shots and pass@8 climbs to 71% (illustrative) — a ~2.5× lift on the same model, same compute budget, just by not throwing away the vector structure of the reward.

Property	GRPO	VPO
Reward representation	Scalar (one number)	Vector (one per objective)
Advantage signal	Group-normalized scalar	Per-objective advantage preserved
Post-training entropy	Collapses to one mode	Maintained across multiple modes
Sampling at inference	Near-duplicate rollouts ~varies by reward structure	Specialized rollouts ~varies by reward structure
pass@k scaling	Plateaus quickly ~setup-dependent, illustrative	Climbs with k ~varies by paper, illustrative
Sweet spot	Single-objective tasks	Test-time search, evolutionary search

This is the same lever the temperature and top-K samplers try to pull at decode time — except those are bolted on after training, with no information about which objective each rollout specialized in. Decode-time diversity hacks can only re-shuffle the post-trained distribution; they can't put probability mass back where training already collapsed it. VPO moves the diversity question upstream into the post-training objective itself.

The paper's bigger claim is the one in the notable quote: as test-time search becomes more standardized — AlphaEvolve-style evolution, best-of-N sampling, agentic planning that retries — preserving diversity stops being a hyperparameter and starts being a post-training objective. The same kind of methodological shift that moved the field from RLHF to RLVR: once an inference-time procedure becomes load-bearing, the post-training step has to be redesigned around what that procedure actually needs from the policy.

Goes deeper in: LLM Internals → Text Generation → Sampling

Related explainers

RLVR — Reinforcement Learning with Verifiable Rewards — the post-training stage VPO replaces the advantage estimator inside
CoPD — co-evolving policy distillation — alternative answer to "how do we get specialized post-training without collapse"
PPOW — window-level RL for speculative drafters — different fix for credit-assignment problems in long-rollout RL
TIM — Training-Inference Mismatch — related diagnostic on what goes wrong when the training and inference distributions drift apart

Continue in trackText Generation: Sampling

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based