CoPD paper — Reinforcement Learning with Verifiable Rewards (RLVR)
LLMThe news. On April 29, 2026, a research paper introduced CoPD — co-evolving policy distillation — a post-training recipe whose inner loop is RLVR. To understand the news, you have to understand RLVR first: it is the default first stage of post-training pipelines for modern reasoning models, and the gradient signal it produces is what every subsequent step (distillation, alignment) leans on. Read the paper →
Picture a stack of math homework. The grader doesn't read the student's reasoning, doesn't argue about elegance, doesn't grade on style — they just compare the boxed final answer to the printed answer key. Tick or cross. Move on. That is exactly what RLVR does to a language model. The model writes a full rollout (its worked answer, generated one token at a time), a tiny program checks the result, and the only signal that flows back is 0 or 1. No human in the loop, no learned reward model, no taste judgement.
The loop has four steps and they don't change. A prompt ("solve this", "implement this function", "prove this lemma") goes in. The model produces a rollout — a complete response, sampled autoregressively. A verifier — a unit-test runner, a symbolic equality check, a proof assistant — grades the rollout. The grade becomes a scalar reward. Gradient ascent on that reward updates the policy. Run that loop millions of times and the model's distribution shifts toward outputs the verifier accepts. The verifier is deterministic — the same rollout always gets the same grade — which is exactly what makes RLVR cheap, scalable, and unfakeable. There is no reward model to overfit to, no preference dataset to curate, no human reviewer to bottleneck on.
The catch is right there in the name: verifiable. RLVR works on tasks where "correct" is a yes-or-no decision a small program can make — math with a closed-form answer, code with a passing test suite, proofs that type-check. It does not work on open-ended writing, taste judgements, or anything fuzzy; for those, you still need RLHF with a learned reward model trained on human preferences. The CoPD paper doesn't replace RLVR, it builds on it: each of CoPD's parallel experts is trained with RLVR as its inner loop, and the contribution is in how the experts are combined afterwards.
What did RLVR replace, and why does it matter?
The previous default for shaping a model's behavior was RLHF — Reinforcement Learning from Human Feedback. The pipeline goes: collect a dataset of human preferences ("rollout A is better than rollout B"), train a small reward model on those judgements, then run RL against the reward model's score. Three pain points compound:
- Cost. Human preference data is slow and expensive to collect. Every new domain needs fresh annotators.
- Drift. The reward model is a learned approximation. Once the policy starts climbing its score, it finds shortcuts the reward model didn't anticipate — reward hacking. Outputs that score high stop being outputs humans actually like.
- Fuzziness. "Is this answer good?" doesn't have a ground truth. Two careful reviewers disagree. The signal is inherently noisy.
RLVR flips all three. Math, code, and proofs have ground-truth checkers. The reward signal is deterministic — same rollout, same grade, every time — and the verifier itself can't be hacked because it's not a learned model. As frontier teams shifted focus from "be helpful and friendly" to "reason correctly through hard problems", RLVR became the cheaper, sharper, scalable signal. It is now the default first stage of reasoning post-training pipelines from open-source labs and frontier labs alike.
That is the foundation the CoPD paper rests on. The contribution of CoPD is not RLVR itself — RLVR is the inner loop. CoPD's contribution is how multiple RLVR-trained experts are combined so they don't suffer from inter-capability divergence (mixed-skill RLVR pulls FFN weights in conflicting directions) or behavioural-gap loss (a frozen expert drifts away from its student during distillation). Without RLVR, there's no foundation; without CoPD, there's no scalable way to combine RLVR experts across many skills.
Goes deeper in: LLM Internals → Generation → One Token at a Time