What is Reinforcement Learning with Verifiable Rewards?

RLVR is a post-training method that improves a language model on tasks whose answers can be checked by a deterministic program — a unit-test runner, a symbolic equality check, or a proof assistant. The model generates a rollout, the program grades it 0 or 1, and gradient ascent on that reward updates the policy. There is no learned reward model in the loop.

How does RLVR differ from RLHF?

RLHF trains a reward model on human preference judgements, then runs RL against that learned reward; the signal is fuzzy, expensive, and game-able. RLVR replaces the reward model with a deterministic verifier, so the signal is cheap, scalable, and unfakeable — but only on tasks where 'correct' is a yes-or-no decision a small program can make.

What kinds of tasks does RLVR work on?

Tasks with checkable answers: math with closed-form solutions, code that passes a unit-test suite, formal proofs a proof assistant can verify. RLVR does not work on open-ended writing, taste judgements, or anything where 'correct' is fuzzy — those still need RLHF with a learned reward model.

CoPD paper — Reinforcement Learning with Verifiable Rewards (RLVR)

LLM

learnaivisually.com/ai-explained/copd-rlvr

Jargon

RLVR: Reinforcement Learning with Verifiable Rewards. A post-training method that grades each model rollout with a deterministic checker (unit test, equality check, proof assistant) and does gradient ascent on a 0-or-1 reward — no learned reward model in the loop.
RLHF: Reinforcement Learning from Human Feedback. The earlier default: train a reward model on human preference annotations, then run RL against its score. Expensive to collect data for, and the reward model can be gamed over time.
rollout: One complete response the model generates autoregressively — the full sequence of tokens produced for a given prompt. In RLVR, each rollout is graded by the verifier to produce a reward.
verifier: A small deterministic program — a unit-test runner, symbolic equality checker, or proof assistant — that grades a rollout. Same rollout always gets the same grade, which is what makes RLVR unfakeable.
reward model: A learned neural network trained on human preference data that scores model outputs. Used in RLHF as a proxy for human judgement. RLVR replaces it with a deterministic verifier so no such proxy is needed.
CoPD: Co-evolving Policy Distillation. A post-training topology that trains N specialist LLMs in parallel as mutual teachers, using RLVR as its inner loop for each expert. Full explainer →

The news. On April 29, 2026, a research paper introduced CoPD — co-evolving policy distillation — a post-training recipe whose inner loop is RLVR. To understand the news, you have to understand RLVR first: it is the default first stage of post-training pipelines for modern reasoning models, and the gradient signal it produces is what every subsequent step (distillation, alignment) leans on. Read the paper →

Picture a stack of math homework. The grader doesn't read the student's reasoning, doesn't argue about elegance, doesn't grade on style — they just compare the boxed final answer to the printed answer key. Tick or cross. Move on. That is exactly what RLVR does to a language model. The model writes a full rollout (its worked answer, generated one token at a time), a tiny program checks the result, and the only signal that flows back is 0 or 1. No human in the loop, no learned reward model, no taste judgement.

The loop has four steps and they don't change. A prompt ("solve this", "implement this function", "prove this lemma") goes in. The model produces a rollout — a complete response, sampled autoregressively. A verifier — a unit-test runner, a symbolic equality check, a proof assistant — grades the rollout. The grade becomes a scalar reward. Gradient ascent on that reward updates the policy. Run that loop millions of times and the model's distribution shifts toward outputs the verifier accepts. The verifier is deterministic — the same rollout always gets the same grade — which is exactly what makes RLVR cheap, scalable, and unfakeable. There is no reward model to overfit to, no preference dataset to curate, no human reviewer to bottleneck on.

The catch is right there in the name: verifiable. RLVR works on tasks where "correct" is a yes-or-no decision a small program can make — math with a closed-form answer, code with a passing test suite, proofs that type-check. It does not work on open-ended writing, taste judgements, or anything fuzzy; for those, you still need RLHF with a learned reward model trained on human preferences. The CoPD paper doesn't replace RLVR, it builds on it: each of CoPD's parallel experts is trained with RLVR as its inner loop, and the contribution is in how the experts are combined afterwards.

What did RLVR replace, and why does it matter?

The previous default for shaping a model's behavior was RLHF — Reinforcement Learning from Human Feedback. The pipeline goes: collect a dataset of human preferences ("rollout A is better than rollout B"), train a small reward model on those judgements, then run RL against the reward model's score. Three pain points compound:

Cost. Human preference data is slow and expensive to collect. Every new domain needs fresh annotators.
Drift. The reward model is a learned approximation. Once the policy starts climbing its score, it finds shortcuts the reward model didn't anticipate — reward hacking. Outputs that score high stop being outputs humans actually like.
Fuzziness. "Is this answer good?" doesn't have a ground truth. Two careful reviewers disagree. The signal is inherently noisy.

RLVR flips all three. Math, code, and proofs have ground-truth checkers. The reward signal is deterministic — same rollout, same grade, every time — and the verifier itself can't be hacked because it's not a learned model. As frontier teams shifted focus from "be helpful and friendly" to "reason correctly through hard problems", RLVR became the cheaper, sharper, scalable signal. It is now the default first stage of reasoning post-training pipelines from open-source labs and frontier labs alike.

That is the foundation the CoPD paper rests on. The contribution of CoPD is not RLVR itself — RLVR is the inner loop. CoPD's contribution is how multiple RLVR-trained experts are combined so they don't suffer from inter-capability divergence (mixed-skill RLVR pulls FFN weights in conflicting directions) or behavioural-gap loss (a frozen expert drifts away from its student during distillation). Without RLVR, there's no foundation; without CoPD, there's no scalable way to combine RLVR experts across many skills.

Goes deeper in: LLM Internals → Generation → One Token at a Time

Related explainers

CoPD paper — Co-evolving Policy Distillation (CoPD)