The news. On June 29, 2026, the DOPD: Dual On-policy Distillation paper (arXiv 2606.30626) proposed an advantage-aware distillation scheme that routes token-level supervision between a privileged teacher and a privileged student, chosen by their advantage gap. It targets a failure mode the authors name the “privilege illusion,” and reports that DOPD consistently beats vanilla on-policy distillation across stability, robustness, continual learning, and out-of-distribution tasks, for both language and vision-language models. Read the paper →

Picture a tutor who is not just smart but is also holding the answer key — the one you will never get on exam day. If you simply copy every answer the tutor gives, some of what you copy is real reasoning you could learn, and some of it is only there because they can see the key. Copy both blindly and you train yourself to imitate someone with information you will never have. That is the privilege illusion, and it is exactly what plain on-policy distillation walks into.

On-policy distillation is a workhorse of model compression: let a small student generate its own attempts, then have a stronger teacher correct them token by token. The trouble is that the teacher often runs with privileged context — a gold answer, an extra document, a hint — that the deployed student will not have. Where the teacher’s edge comes from that context rather than from real skill, the correction pushes the student toward an answer it structurally cannot reproduce. The gradient looks like teaching; it is really chasing a ghost.

Here is DOPD’s move, and it is the whole idea: add a second reference — the same student, but allowed one peek at the privileged context. Now, for each token, you can ask a sharp question instead of a blurry one. If the peek-enabled student closes most of the gap on its own, the teacher’s advantage there was just the answer key, so DOPD steps back. If the gap survives even with the peek, that is real capability the student is missing, so DOPD drills it hard. The routing reads each token’s probabilities and advantages — how much better a policy’s choice is on that token — and sends the supervision to whichever privileged policy is trustworthy, which matters because only some tokens carry the real skill signal at all.

Walk one token through it (illustrative numbers). Say the teacher’s advantage on a token is 0.9. You let the privileged student peek: on token A it also reaches 0.85, so the gap the deployed student could close on its own is only about 0.05 — roughly 94% of the teacher’s edge here was the answer key, and DOPD down-weights it. On token B the peeking student only reaches 0.3, leaving a 0.6 gap that the key did not explain — that is real capability, and DOPD trains hard on exactly that token. Same teacher advantage, opposite treatment, because the peek separated skill from information.

MethodReference it copiesPer-token decisionWhat the student ends up learning
Vanilla OPDone privileged teachernone — imitate every token equallyreal skill and un-reproducible answer-key habits (the privilege illusion)
DOPD (arXiv 2606.30626)a privileged teacher and a privileged studentroute by advantage gap: drill real gaps, ease off information gapstransferable capability, not an answer key it can never hold

Because the student stops chasing signals it can never reproduce, it comes out steadier — the paper reports DOPD beating vanilla OPD across stability, robustness, continual learning, and out-of-distribution tasks, and it holds for vision-language models as well as text. The headline is not a single benchmark number; it is that telling a skill gap apart from an information gap, token by token, is what makes the distillation transfer.

Goes deeper in: LLM Internals → Text Generation → From Logits to Probabilities

Related explainers

Continue in trackLLM Internals — Text Generation: from logits to probabilities

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based