The news. On June 15, 2026, Together AI released Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation. The problem it tackles: converting a pretrained softmax Transformer into a hybrid linear-attention model (a Gated DeltaNet student) is brittle, because naively copying the teacher's attention projections leaves the student's recurrent decay, write, and output dynamics unspecified — so it starts in a poor regime and burns distillation tokens just fixing its initialization. Taylor-Calibrate instead uses Taylor-guided teacher-attention statistics to set those gates, then aligns each layer to the teacher. Across four teacher settings and three retained-layer policies, it reports up to an 88× stronger zero-shot student and 4.9–9.2× fewer training tokens to matched recovery. Read the paper →

Picture a long-running play whose lead is precise but exhausting to keep on stage every night. The theatre wants a cheaper understudy who can carry the same show. Hand the understudy only the script — the lines — and they'll stumble, because the script never says how long to hold a pause, when to step forward, or how hard to react. That unwritten part is the blocking and timing, and it turns out a softmax-attention Transformer — the lead — has no explicit knobs for it at all. Its replacement does: a Gated DeltaNet student is a linear-attention layer whose memory is run by three gates — a decay (how fast old memory fades), a write gate (how much each token records), and an output gate (how much it reads back). Those gates are the blocking and timing.

So when you convert the Transformer into the student, the obvious move — copy the Q, K, V projections straight across — hands over the lines but leaves the blocking blank. The student opens in a terrible state and has to claw its way back to where the teacher already was, paying for the climb in distillation tokens.

Taylor-Calibrate sets the blocking by watching the lead perform. It runs a Taylor expansion of the teacher's attention map and uses those statistics to set the student's value projection, memory timescale (the decay bias), write gate, and output gate — instead of copying projections blindly. Then a short per-layer alignment matches each converted layer's output to the teacher's on the same inputs — a scene-by-scene dress rehearsal — before any global distillation begins. In the authors' framing:

"Taylor-Calibrate uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then performs a short per-layer alignment step to match each converted layer to the teacher output."

Softmax teacherpretrained Transformer
Taylor readexpand the attention map
Set gatesdecay · write · output
Per-layer alignmatch the teacher
Distill → GDN4.9–9.2× fewer tokens

Why bother converting at all? Because a linear-attention layer keeps a fixed-size recurrent state instead of a KV cache that grows with every token — so long-context inference gets cheaper and far lighter on memory bandwidth (the same pressure that pushes serving toward tricks like grouped-query attention). Training such a hybrid from scratch is costly, so the live trend is to convert an existing Transformer — which is exactly where the token bill was landing, and exactly what a better starting point shrinks.

How much does the head start save? Hold the recovery target fixed — the quality bar the converted student must clear. Naive conversion needs some budget of distillation tokens to get there; call it B. Taylor-Calibrate reaches the same bar with B ÷ 4.9 to B ÷ 9.2between ~20% and ~11% of the tokens. If B were 10 billion tokens (illustrative — the paper reports the ratio, not an absolute count), that's ≈1.1–2.0 billion for the same quality. And measured at the very start, before a single distillation step, the calibrated student is up to 88× better zero-shot than the cold-copied one — that head start is what the token savings buy.

Conversion stepNaive (copy projections)Taylor-Calibrate
Value projection & gatesCopy attention projections; decay / write / output gates left at defaultsTaylor-guided teacher statistics set the value projection and the decay, write, and output gates
Before distillation (zero-shot)Poor starting regime~88× better zero-shot (representative ablation)
Pre-distillation stepNoneShort per-layer alignment to the teacher's outputs
Tokens to matched targetBaseline budget B~4.9–9.2× fewer
Validation breadth4 teacher settings, 3 retained-layer policies

Goes deeper in: LLM Internals → Self-Attention → Computing Attention Scores and LLM Internals → KV Cache → Memory Cost

Related explainers

Frequently Asked Questions