What is co-evolving policy distillation?

Co-evolving policy distillation (CoPD) is a post-training topology where N specialist LLMs train in parallel — one per target skill, all starting from the same base model — and at every training step each one distils on-policy from the current rollouts of its peers. Every model is simultaneously a teacher and a student; nobody is frozen. The 'co-evolving' property is that all peers move together, so every distillation target stays on-policy.

How is CoPD different from training one model on mixed-skill RLVR?

Single-model mixed-skill RLVR suffers from inter-capability divergence: the gradient for math and the gradient for code update the same FFN weights in conflicting directions, so gains on one skill can erase gains on another. CoPD sidesteps this by giving each skill its own dedicated expert and re-aligning the experts via mutual distillation — each expert's FFN is shaped by one skill's gradient, and the distillation step shares what each expert has learned across the group.

CoPD paper — Co-evolving Policy Distillation between parallel experts

LLM

learnaivisually.com/ai-explained/copd-co-evolving-policy-distillation

TL;DR

What is it: CoPD — co-evolving policy distillation — is a post-training topology that trains N specialist LLMs in parallel as mutual teachers. Every step, each expert distils on-policy from the current rollouts of its peers; nobody is ever frozen.
Why it’s needed: Single-model RLVR on mixed-skill batches makes math and code gradients fight over the same FFN weights — gains on one skill erase gains on another. CoPD gives every skill its own expert and re-aligns them via mutual distillation, so capacity isn’t wasted on inter-skill conflict.
vs previous: The previous default — RLVR followed by a one-shot OPD pass from frozen experts into a student — suffers behavioural-gap loss as the student drifts away from where the teachers were calibrated. CoPD keeps every distillation target on-policy because no peer is ever frozen.

Jargon

RLVR: Reinforcement Learning with Verifiable Rewards. A post-training method that grades each model rollout with a deterministic checker (unit test, equality check, proof assistant) and does gradient ascent on a 0-or-1 reward. Full explainer →
OPD: On-policy distillation. A student model is trained to match a teacher’s outputs on the student’s own current rollouts (not stale stored data) — so the distillation signal stays calibrated to where the student actually is.
FFN: Feed-forward network. The dense per-token MLP inside every transformer block. Most of a transformer’s parameters live here, and every token touches every FFN weight — which is why mixed-skill gradients can fight inside it.
MoE: Mixture-of-Experts. An inference-time architecture where one model contains many small expert sub-networks inside its FFN, and a router activates only a few per token. CoPD’s “parallel experts” are not MoE — they are separate full models.
inter-capability divergence: The CoPD paper’s name for the pathology of single-model mixed-skill RLVR: gradients for math and code update the same FFN weights in conflicting directions, so gains on one skill can erase gains on another.
behavioural-gap loss: The CoPD paper’s name for the failure of frozen-teacher distillation: as gradient updates land, the student’s rollout distribution drifts away from where the frozen teachers were calibrated, and the distillation signal goes off-policy.

The news. On April 29, 2026, a research paper introduced CoPD — co-evolving policy distillation — a post-training topology that trains N specialist LLMs in parallel as mutual teachers, each one simultaneously a teacher and a student via on-policy distillation every step. The paper reports CoPD beats the standard pipeline of RLVR plus a one-shot expert-to-student OPD pass on mixed math, code, and reasoning at the same compute and the same base model, by fixing two named failure modes: inter-capability divergence and behavioural-gap loss. Read the paper →

Picture three specialists in a study group — one focused on math, one on code, one on reasoning — and they all start the week from the same shared textbook. The "lone crammer" alternative is one student trying to absorb every subject at once: when the math chapter rewrites the same page that the code chapter just rewrote, gains on one subject erase gains on the other. The CoPD group avoids that by giving each subject its own specialist. But here's the second twist: every morning the specialists swap their current notes and quiz each other. Nobody works from frozen lecture notes. That daily peer-quiz is the on-policy distillation step, and "co-evolving" is the property that no specialist is ever frozen — they all move together.

The two pathologies CoPD names are easiest to see by contrast. Inter-capability divergence is what bites the lone crammer. Most of the parameters in a transformer block live in the feed-forward network — every FFN weight is touched by every training token, so when you run RLVR on mixed-skill batches the gradient for math and the gradient for code update the same FFN weights, often in conflicting directions. Specialised experts sidestep this because each expert's FFN is shaped by one skill's gradient signal. Behavioural-gap loss is the failure of the more obvious fix — train N frozen experts, then distil them all into one student. The student starts well-calibrated, but as gradient updates start moving its weights its rollout distribution drifts away from where the frozen teachers were calibrated, so the distillation signal becomes off-policy and stale. CoPD's "every peer also moves" design keeps every distillation target on-policy at every step — there is no frozen teacher to drift away from.

The paper sets CoPD up against two baselines, and the shape of the table is the easiest way to see what changes:

Recipe	Models	Distillation	Failure mode it solves
`RLVR alone`	1 policy, mixed-skill batches	—	—
`RLVR + OPD`	N frozen experts → 1 student	One-shot, post-RLVR	Inter-capability divergence
`CoPD`	N peers, mutual	On-policy, every step	+ Behavioural-gap loss

Worked example. Suppose you train 3 experts (math, code, reasoning) in parallel for 10 000 steps each. The standard RLVR + OPD pipeline does 10 000 RLVR steps per expert — about 30 000 expert-steps total — then a separate one-shot OPD pass that runs the frozen experts against the student. CoPD instead does the same 30 000 expert-steps, but on each step every expert spends a fraction of its compute on on-policy distillation against the current peers. Total compute is roughly equal; what changes is the direction of the gradient, not its magnitude. The peers act as a regulariser against single-skill divergence and as an always-fresh teacher against behavioural-gap drift.

One caveat is worth flagging because the vocabulary collides. CoPD's "parallel experts" are separate full models trained side by side — they are not Mixture-of-Experts layers inside one model. MoE is an inference-time architecture where one model contains many small expert sub-networks and a router activates only a few per token. CoPD is a post-training topology — it changes how training runs are wired together, not what the inference-time model looks like. Reusing the MoE vocabulary here would be a category error.

Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network

Related explainers

CoPD paper — Reinforcement Learning with Verifiable Rewards (RLVR)

CoPD paper — Co-evolving Policy Distillation between parallel experts

Related explainers

Frequently Asked Questions