What is dual on-policy distillation (DOPD)?

DOPD (arXiv 2606.30626) is an advantage-aware distillation method that trains a small student to imitate a stronger teacher token by token, but decides per token how much to trust the teacher. It compares a privileged teacher and a privileged student — the same student allowed to see the extra context — and routes each token's supervision toward whichever carries the more trustworthy advantage signal, so the student learns transferable skill rather than an answer key it can never reproduce.

What is the 'privilege illusion' it fixes?

The privilege illusion is when privileged context — a gold answer, a hint, an extra document that only the teacher sees — hides which gap the student is really facing. From the outside a genuine capability gap (real skill to learn) and an information gap (an edge the student can never reproduce) look identical. Plain on-policy distillation copies both equally, so the student wastes training signal learning to mimic information it will never have at deployment.

How does DOPD relate to plain on-policy distillation?

Vanilla on-policy distillation (OPD) has the student generate its own attempts and imitate one privileged teacher's corrections on every token. DOPD keeps the on-policy setup but adds a second reference and a per-token router: where the teacher's edge survives a peek at the privileged context it is treated as real capability and drilled hard, and where the edge vanishes with the peek it is treated as pure information asymmetry and eased off. The paper reports it consistently beats vanilla OPD on stability, robustness, continual learning, and out-of-distribution tasks, for both LLMs and VLMs.

DOPD dodges the 'privilege illusion' — Dual on-policy distillation

Jargon

On-policy distillation (OPD): The student learns from a teacher’s corrections on the student’s own attempts, not on the teacher’s finished answers. Training on its own rollouts keeps the lessons relevant to what the student actually does.
Privileged information: Extra context handed to a model that it will not have when it is actually deployed — a hint, a gold answer, a reference document. It makes the model look stronger without teaching it anything reusable.
Privilege illusion: The paper’s name for the trap where privileged context hides which gap the student is really facing — a genuine skill gap it should close, or a pure information gap it can only mimic.
Capability gap vs information asymmetry: Two reasons a teacher can beat the student on a token. A capability gap is transferable skill the student can learn; an information asymmetry is an edge the student can never reproduce because it lacks the privileged context.
Advantage-aware routing: DOPD compares each token’s probabilities and advantages, then sends its supervision toward whichever privileged policy carries the more trustworthy signal — strong where the gap is real, gentle where it is just the answer key.
VLM (vision-language model): A model that takes images plus text. DOPD is tested on both plain LLMs and VLMs, so the routing idea is not tied to one modality.

The news. On June 29, 2026, the DOPD: Dual On-policy Distillation paper (arXiv 2606.30626) proposed an advantage-aware distillation scheme that routes token-level supervision between a privileged teacher and a privileged student, chosen by their advantage gap. It targets a failure mode the authors name the “privilege illusion,” and reports that DOPD consistently beats vanilla on-policy distillation across stability, robustness, continual learning, and out-of-distribution tasks, for both language and vision-language models. Read the paper →

Picture a tutor who is not just smart but is also holding the answer key — the one you will never get on exam day. If you simply copy every answer the tutor gives, some of what you copy is real reasoning you could learn, and some of it is only there because they can see the key. Copy both blindly and you train yourself to imitate someone with information you will never have. That is the privilege illusion, and it is exactly what plain on-policy distillation walks into.

On-policy distillation is a workhorse of model compression: let a small student generate its own attempts, then have a stronger teacher correct them token by token. The trouble is that the teacher often runs with privileged context — a gold answer, an extra document, a hint — that the deployed student will not have. Where the teacher’s edge comes from that context rather than from real skill, the correction pushes the student toward an answer it structurally cannot reproduce. The gradient looks like teaching; it is really chasing a ghost.

Here is DOPD’s move, and it is the whole idea: add a second reference — the same student, but allowed one peek at the privileged context. Now, for each token, you can ask a sharp question instead of a blurry one. If the peek-enabled student closes most of the gap on its own, the teacher’s advantage there was just the answer key, so DOPD steps back. If the gap survives even with the peek, that is real capability the student is missing, so DOPD drills it hard. The routing reads each token’s probabilities and advantages — how much better a policy’s choice is on that token — and sends the supervision to whichever privileged policy is trustworthy, which matters because only some tokens carry the real skill signal at all.

Walk one token through it (illustrative numbers). Say the teacher’s advantage on a token is 0.9. You let the privileged student peek: on token A it also reaches 0.85, so the gap the deployed student could close on its own is only about 0.05 — roughly 94% of the teacher’s edge here was the answer key, and DOPD down-weights it. On token B the peeking student only reaches 0.3, leaving a 0.6 gap that the key did not explain — that is real capability, and DOPD trains hard on exactly that token. Same teacher advantage, opposite treatment, because the peek separated skill from information.

Method	Reference it copies	Per-token decision	What the student ends up learning
Vanilla OPD	one privileged teacher	none — imitate every token equally	real skill and un-reproducible answer-key habits (the privilege illusion)
DOPD (arXiv 2606.30626)	a privileged teacher and a privileged student	route by advantage gap: drill real gaps, ease off information gaps	transferable capability, not an answer key it can never hold

Because the student stops chasing signals it can never reproduce, it comes out steadier — the paper reports DOPD beating vanilla OPD across stability, robustness, continual learning, and out-of-distribution tasks, and it holds for vision-language models as well as text. The headline is not a single benchmark number; it is that telling a skill gap apart from an information gap, token by token, is what makes the distillation transfer.

Goes deeper in: LLM Internals → Text Generation → From Logits to Probabilities

Related explainers

OPID — On-policy skill distillation — the same on-policy distillation family, here mining skills from an agent’s own runs
CoPD — Co-evolving policy distillation — another fix for on-policy distillation, keeping every teacher on-policy instead of frozen
Agents-A1 — Scaling the horizon — a small model that matches giants, trained with multi-teacher on-policy distillation
Distillation in pre-training — non-monotonic teacher strength — when a stronger teacher can actually hurt the student

Continue in trackLLM Internals — Text Generation: from logits to probabilities

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based