Distillation in pre-training — non-monotonic teacher strength

LLM
L
Does a stronger teacher always mean a better student?Student gain over baseline+max0no-distillation baselineweak · undertrainedstrong · well-trainedTeacher strength →Where the gains landout-of-domain ≫ in-domainout-of-domain gain (generalization)in-domain gain (small)
learnaivisually.com/ai-explained/non-monotonic-pretrain-distillation

The news. On May 22, 2026, Lu and Liu released a sweep of pre-training distillation varying teacher size, teacher training budget, student size, and the mixed-loss coefficient. The headline finding contradicts the standard "pick the strongest teacher you can afford" playbook: small undertrained teachers still help larger students when distillation and language-model losses are mixed, and stronger teachers can diminish or reverse the gains. The benefit, where it shows up, lands disproportionately on generalization — out-of-domain evals — rather than on the in-domain validation loss most teams steer by.

Picture an apprentice chess player about to spend a season playing against a sparring partner. The standing advice — and the one every coach gives — is "play someone stronger than you." Find the highest-rated partner who will spend time with you. The new paper is the careful version of the experiment where the apprentice plays a full season against partners at three different ratings: a club player, a master, and a grandmaster. The apprentice's rating gain over the season is measured. It is not the grandmaster who produces the biggest improvement. It is the club player — the partner the apprentice can actually read mid-game, whose moves they can argue with, whose mistakes they can spot. The grandmaster wins every game and the apprentice walks away from the season slightly weaker than baseline, having absorbed patterns they cannot yet operationalize.

The mechanism is straightforward once the metaphor sits still. In pre-training distillation, the student trains on a mixed loss — part standard language-model loss against the corpus tokens, part distillation loss against the teacher's logits on the same tokens. The mix coefficient α blends the two. With α=0 the student is just doing normal pre-training; with α=1 it ignores the corpus targets and only chases the teacher. The paper independently sweeps four axes: teacher size, teacher training budget, student size, and α. The joint surface is what reveals the non-monotonicity. A teacher who is too far ahead of the student emits logits the student cannot match in the LM objective without distorting representations that would have served it on novel data. The mixed loss is not a free signal — it pulls the student toward the teacher's geometry, and when that geometry is wildly more capable than the student's, the pull is destructive. The roofline of useful distillation is set by the student, not the teacher.

The secondary finding is the one that explains why the field missed this. Teams evaluate distillation by checking in-domain validation loss — perplexity on a slice of the pre-training corpus. The paper reports that under a mixed loss, gains concentrate on out-of-domain evals: held-out tasks, new formats, prompts the corpus never saw. In-domain perplexity barely moves. So a team picking the strongest teacher they can afford, eyeballing in-domain val loss, sees no help from the weaker teacher and concludes correctly (within that frame) that the stronger teacher is the right one. The frame is what was wrong. Once you measure out-of-domain, the picture inverts.

Where the gap actually comes from

Hold three variables fixed. One 1.4B-parameter student. The same 30B-token pre-training corpus. The same α = 0.5 mixed-loss coefficient. Two contrasting teachers: a 0.3B teacher trained on 6B tokens (small and undertrained), and a 7B teacher trained on 300B tokens (strong, well-trained). The student trained against the weak teacher ends the run with a roughly +3.4% out-of-domain accuracy gain over the no-distillation baseline run, and approximately +0.5% in-domain. The student trained against the strong teacher ends with +0.3% out-of-domain and a small negative in-domain delta (illustrative numbers calibrated to the paper's directional finding — sweeping the four axes produces a continuous surface, and these are two corner points that match its shape, not specific reported values). The arithmetic is the surprise: the weaker teacher delivered roughly ~10× more out-of-domain gain per unit of compute spent on producing the teacher, because both the teacher's pre-training and the student's distillation pass were cheaper.

TeacherStudentMix αOut-of-domain ΔIn-domain Δ
0.3B / 6B tokens (weak)1.4B0.5~+3.4% (illustrative)~+0.5% (illustrative)
1.4B / 30B tokens (match)1.4B0.5~+2.1% (illustrative)~+0.4% (illustrative)
7B / 300B tokens (strong)1.4B0.5~+0.3% (illustrative)~–0.1% (illustrative)
none (LM-only baseline)1.4B00% (reference)0% (reference)

The deeper lesson is what this rearranges for practitioners. Distillation has been treated like a quality dial — bigger teacher, better student, marginal cost the only constraint. The sweep reframes it as an impedance-matching problem: the teacher's representations have to be close enough to the student's to be transferable. That mirrors a thread running through several other recent training-stack explainers — the training-inference mismatch in RL is the same flavor of problem (one distribution drifting too far from another), and TFGN's subspace-preserving updates attack a closely related failure mode in continual pre-training. The unifying claim across all of these: when two signals are too far apart, mixing them does not interpolate — it distorts. And the easiest way to make the signals close is to keep the teacher modest. The student's embedding space cannot absorb structure it has no room for; the teacher's logits are only useful insofar as the student can re-derive them on novel data.

A caveat the paper is careful about: this is pre-training distillation, not the post-training knowledge-distillation recipes used to compress a finished large model into a smaller deployable one. Post-training distillation operates on a teacher whose representations have already converged, and it usually does want a strong teacher. The non-monotonic finding here applies specifically when distillation runs as a side-channel during the student's normal pre-training pass.

Goes deeper in: LLM Internals → Generation → Logits

Related explainers

Frequently Asked Questions