What is non-monotonic teacher strength in pre-training distillation?

It is the finding that increasing the teacher's size and training budget does not monotonically increase the student's gain over a no-distillation baseline. In a 2026 sweep by Lu and Liu, gains rise as the teacher gets stronger, peak at a small undertrained teacher, then fall — eventually crossing into negative territory where distillation actively hurts. The same student gets a bigger boost from a weak teacher than from a strong one, contradicting the standard 'pick the strongest teacher you can afford' playbook.

Why does a stronger teacher sometimes hurt the student?

A stronger teacher's logits encode representations the student cannot match without distorting structures that were useful on novel data. Under a mixed loss the student is pulled toward both the corpus targets and the teacher's distribution; when the teacher is far ahead, that pull warps the student away from generalizable features. Gains flatten or reverse, especially in-domain. The student's capacity sets the ceiling on useful distillation — the teacher cannot lift the student past representations the student can still re-derive on novel inputs.

What does it mean that distillation gains land out-of-domain rather than in-domain?

In-domain evals score the student on held-out data drawn from the pre-training distribution; out-of-domain evals use different formats and tasks. The paper sees small in-domain movement and substantially larger out-of-domain gains — the student generalizes better but does not necessarily fit its own training distribution any better. Teams steering only by in-domain validation loss will miss the benefit entirely, which explains why the field defaulted to a 'bigger teacher' heuristic that the joint sweep contradicts.

Distillation in LLM pre-training — Non-monotonic teacher strength

Distillation in pre-training — non-monotonic teacher strength

LLM

learnaivisually.com/ai-explained/non-monotonic-pretrain-distillation

TL;DR

What is it: A new pre-training distillation paper sweeps teacher size, teacher training budget, student size, and the loss-mix coefficient, and finds that the relationship between teacher strength and student gain is non-monotonic — small undertrained teachers still help large students, while strong teachers can reduce the same student's gain over baseline.
Why it’s needed: Pre-training distillation is the cheapest knob the field has for moving the loss curve at fixed compute — used to make small open models punch above their weight and to seed billion-parameter runs from larger checkpoints. If the assumption "find the strongest teacher you can afford" is wrong, every team running these recipes has been spending compute on the wrong end of a curve.
vs previous: Earlier distillation work optimized for teacher quality on the same in-domain evals the student would see — and most production playbooks ("use the biggest teacher that fits") were a direct consequence. This paper varies four axes jointly and shows the safer policy is a small, slightly undertrained teacher with a mixed loss, and that the gains concentrate on generalization rather than in-domain accuracy.

Jargon

Distillation (pre-training): Training a student network against a teacher network's predictions during the regular language-model pre-training run — instead of waiting until after pre-training and doing a separate fine-tuning pass. The student sees the teacher's full output distribution (the logits) on the same tokens as the LM data, not just the one-hot target.
Mixed loss: The training objective is a weighted sum of two losses: the ordinary language-model loss against the corpus targets, and a distillation loss against the teacher's logits. A mix coefficient α decides how much of each. Pure LM (α=0) is normal pre-training; pure distillation (α=1) ignores the corpus targets entirely.
Teacher strength: A short-hand for two correlated things: the teacher model's parameter count and the number of tokens it was itself trained on. A "stronger" teacher has more of both. The paper varies them independently and finds neither alone behaves monotonically.
Non-monotonic: A relationship is non-monotonic when increasing the input does not always increase the output. Here, student gain over the no-distillation baseline rises, peaks at a small/undertrained teacher, and then falls as the teacher gets stronger — eventually crossing zero (i.e. distillation actively hurts).
In-domain vs out-of-domain: In-domain evals score the student on data drawn from the same distribution it was trained on (e.g. validation perplexity on the pre-training corpus). Out-of-domain evals score it on held-out distributions (different formats, different tasks). The paper reports that distillation gains land mostly out-of-domain.
Generalization gap: The mismatch between how a model performs on its training distribution and on a held-out distribution. A model with a small generalization gap "generalizes well." This paper shows distillation reduces the gap — gains accumulate on the held-out side.

The news. On May 22, 2026, Lu and Liu released a sweep of pre-training distillation varying teacher size, teacher training budget, student size, and the mixed-loss coefficient. The headline finding contradicts the standard "pick the strongest teacher you can afford" playbook: small undertrained teachers still help larger students when distillation and language-model losses are mixed, and stronger teachers can diminish or reverse the gains. The benefit, where it shows up, lands disproportionately on generalization — out-of-domain evals — rather than on the in-domain validation loss most teams steer by.

Picture an apprentice chess player about to spend a season playing against a sparring partner. The standing advice — and the one every coach gives — is "play someone stronger than you." Find the highest-rated partner who will spend time with you. The new paper is the careful version of the experiment where the apprentice plays a full season against partners at three different ratings: a club player, a master, and a grandmaster. The apprentice's rating gain over the season is measured. It is not the grandmaster who produces the biggest improvement. It is the club player — the partner the apprentice can actually read mid-game, whose moves they can argue with, whose mistakes they can spot. The grandmaster wins every game and the apprentice walks away from the season slightly weaker than baseline, having absorbed patterns they cannot yet operationalize.

The mechanism is straightforward once the metaphor sits still. In pre-training distillation, the student trains on a mixed loss — part standard language-model loss against the corpus tokens, part distillation loss against the teacher's logits on the same tokens. The mix coefficient α blends the two. With α=0 the student is just doing normal pre-training; with α=1 it ignores the corpus targets and only chases the teacher. The paper independently sweeps four axes: teacher size, teacher training budget, student size, and α. The joint surface is what reveals the non-monotonicity. A teacher who is too far ahead of the student emits logits the student cannot match in the LM objective without distorting representations that would have served it on novel data. The mixed loss is not a free signal — it pulls the student toward the teacher's geometry, and when that geometry is wildly more capable than the student's, the pull is destructive. The roofline of useful distillation is set by the student, not the teacher.

The secondary finding is the one that explains why the field missed this. Teams evaluate distillation by checking in-domain validation loss — perplexity on a slice of the pre-training corpus. The paper reports that under a mixed loss, gains concentrate on out-of-domain evals: held-out tasks, new formats, prompts the corpus never saw. In-domain perplexity barely moves. So a team picking the strongest teacher they can afford, eyeballing in-domain val loss, sees no help from the weaker teacher and concludes correctly (within that frame) that the stronger teacher is the right one. The frame is what was wrong. Once you measure out-of-domain, the picture inverts.

Where the gap actually comes from

Hold three variables fixed. One 1.4B-parameter student. The same 30B-token pre-training corpus. The same α = 0.5 mixed-loss coefficient. Two contrasting teachers: a 0.3B teacher trained on 6B tokens (small and undertrained), and a 7B teacher trained on 300B tokens (strong, well-trained). The student trained against the weak teacher ends the run with a roughly +3.4% out-of-domain accuracy gain over the no-distillation baseline run, and approximately +0.5% in-domain. The student trained against the strong teacher ends with +0.3% out-of-domain and a small negative in-domain delta (illustrative numbers calibrated to the paper's directional finding — sweeping the four axes produces a continuous surface, and these are two corner points that match its shape, not specific reported values). The arithmetic is the surprise: the weaker teacher delivered roughly ~10× more out-of-domain gain per unit of compute spent on producing the teacher, because both the teacher's pre-training and the student's distillation pass were cheaper.

Teacher	Student	Mix α	Out-of-domain Δ	In-domain Δ
0.3B / 6B tokens (weak)	1.4B	0.5	~+3.4% (illustrative)	~+0.5% (illustrative)
1.4B / 30B tokens (match)	1.4B	0.5	~+2.1% (illustrative)	~+0.4% (illustrative)
7B / 300B tokens (strong)	1.4B	0.5	~+0.3% (illustrative)	~–0.1% (illustrative)
none (LM-only baseline)	1.4B	0	0% (reference)	0% (reference)

The deeper lesson is what this rearranges for practitioners. Distillation has been treated like a quality dial — bigger teacher, better student, marginal cost the only constraint. The sweep reframes it as an impedance-matching problem: the teacher's representations have to be close enough to the student's to be transferable. That mirrors a thread running through several other recent training-stack explainers — the training-inference mismatch in RL is the same flavor of problem (one distribution drifting too far from another), and TFGN's subspace-preserving updates attack a closely related failure mode in continual pre-training. The unifying claim across all of these: when two signals are too far apart, mixing them does not interpolate — it distorts. And the easiest way to make the signals close is to keep the teacher modest. The student's embedding space cannot absorb structure it has no room for; the teacher's logits are only useful insofar as the student can re-derive them on novel data.

A caveat the paper is careful about: this is pre-training distillation, not the post-training knowledge-distillation recipes used to compress a finished large model into a smaller deployable one. Post-training distillation operates on a teacher whose representations have already converged, and it usually does want a strong teacher. The non-monotonic finding here applies specifically when distillation runs as a side-channel during the student's normal pre-training pass.

Goes deeper in: LLM Internals → Generation → Logits

Related explainers

CoPD — Co-evolving Policy Distillation between parallel experts — distillation as RL between peers, not from a stronger teacher
ZEDA — Zero-output expert self-distillation — distilling within a single MoE rather than across model sizes
TIM — Training-Inference Mismatch in RL — the closest sibling failure mode: two distributions diverging during training

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based