CoPD paper — Co-evolving Policy Distillation between parallel experts

LLM
L
CoPD animation — three post-training recipes side by side: RLVR alone suffers gradient collision, RLVR plus frozen OPD drifts off-policy, and CoPD keeps every peer on-policy via mutual distillation.
learnaivisually.com/ai-explained/copd-co-evolving-policy-distillation

The news. On April 29, 2026, a research paper introduced CoPD — co-evolving policy distillation — a post-training topology that trains N specialist LLMs in parallel as mutual teachers, each one simultaneously a teacher and a student via on-policy distillation every step. The paper reports CoPD beats the standard pipeline of RLVR plus a one-shot expert-to-student OPD pass on mixed math, code, and reasoning at the same compute and the same base model, by fixing two named failure modes: inter-capability divergence and behavioural-gap loss. Read the paper →

Picture three specialists in a study group — one focused on math, one on code, one on reasoning — and they all start the week from the same shared textbook. The "lone crammer" alternative is one student trying to absorb every subject at once: when the math chapter rewrites the same page that the code chapter just rewrote, gains on one subject erase gains on the other. The CoPD group avoids that by giving each subject its own specialist. But here's the second twist: every morning the specialists swap their current notes and quiz each other. Nobody works from frozen lecture notes. That daily peer-quiz is the on-policy distillation step, and "co-evolving" is the property that no specialist is ever frozen — they all move together.

The two pathologies CoPD names are easiest to see by contrast. Inter-capability divergence is what bites the lone crammer. Most of the parameters in a transformer block live in the feed-forward network — every FFN weight is touched by every training token, so when you run RLVR on mixed-skill batches the gradient for math and the gradient for code update the same FFN weights, often in conflicting directions. Specialised experts sidestep this because each expert's FFN is shaped by one skill's gradient signal. Behavioural-gap loss is the failure of the more obvious fix — train N frozen experts, then distil them all into one student. The student starts well-calibrated, but as gradient updates start moving its weights its rollout distribution drifts away from where the frozen teachers were calibrated, so the distillation signal becomes off-policy and stale. CoPD's "every peer also moves" design keeps every distillation target on-policy at every step — there is no frozen teacher to drift away from.

skipskip
Input x
LayerNorm 1
Multi-Head Attention
+ Residual
LayerNorm 2
Feed-Forward Network
+ Residual
Output x

The paper sets CoPD up against two baselines, and the shape of the table is the easiest way to see what changes:

RecipeModelsDistillationFailure mode it solves
RLVR alone1 policy, mixed-skill batches
RLVR + OPDN frozen experts → 1 studentOne-shot, post-RLVRInter-capability divergence
CoPDN peers, mutualOn-policy, every step+ Behavioural-gap loss

Worked example. Suppose you train 3 experts (math, code, reasoning) in parallel for 10 000 steps each. The standard RLVR + OPD pipeline does 10 000 RLVR steps per expert — about 30 000 expert-steps total — then a separate one-shot OPD pass that runs the frozen experts against the student. CoPD instead does the same 30 000 expert-steps, but on each step every expert spends a fraction of its compute on on-policy distillation against the current peers. Total compute is roughly equal; what changes is the direction of the gradient, not its magnitude. The peers act as a regulariser against single-skill divergence and as an always-fresh teacher against behavioural-gap drift.

One caveat is worth flagging because the vocabulary collides. CoPD's "parallel experts" are separate full models trained side by side — they are not Mixture-of-Experts layers inside one model. MoE is an inference-time architecture where one model contains many small expert sub-networks and a router activates only a few per token. CoPD is a post-training topology — it changes how training runs are wired together, not what the inference-time model looks like. Reusing the MoE vocabulary here would be a category error.

Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network

Related explainers

Frequently Asked Questions