Hybrid linear-attention models swap the ever-growing KV cache for a fixed-size recurrent memory, making long-context inference cheaper and lighter on memory bandwidth — but converting a pretrained Transformer into one normally burns a huge pile of distillation tokens. By starting the student in the right regime, Taylor-Calibrate reaches the same recovery target with 4.9–9.2× fewer training tokens, making the conversion far cheaper across four teacher settings and three retained-layer policies.

How does it relate to KV cache and linear attention?

Standard softmax attention backs a KV cache that grows with every token, so long context is expensive. Linear-attention layers like Gated DeltaNet replace that cache with a fixed-size recurrent state governed by decay, write, and output gates. Those gates have no direct equivalent in softmax attention, which is why a naive copy leaves them unset — and why deriving them from the teacher (the Taylor-Calibrate step) is what makes the converted model usable quickly.

Taylor-Calibrate cuts hybrid-attention distillation tokens 4.9–9.2× — Taylor-guided gate initialization

Q: What is Taylor-guided gate initialization?

It's a principled way to start a linear-attention student when converting a softmax Transformer into a hybrid model. Instead of copying the teacher's attention projections blindly, Taylor-Calibrate (arXiv 2606.16429) runs a Taylor expansion of the teacher's attention map and uses those statistics to set the Gated DeltaNet student's value projection and its decay, write, and output gates — then aligns each converted layer to the teacher before full distillation. The student opens up to 88× better zero-shot.

TL;DR

What is it: A new paper, Taylor-Calibrate (Together AI, arXiv 2606.16429), and the focus here is its method — Taylor-guided gate initialization: a principled way to start a Gated DeltaNet linear-attention student when you convert a softmax Transformer into a hybrid model.
Why it’s needed: Hybrid linear-attention models trade the ever-growing KV cache for a fixed-size recurrent memory — cheaper, faster long-context inference — but converting a pretrained Transformer into one normally burns a huge pile of distillation tokens; this makes that conversion 4.9–9.2× cheaper by starting the student in the right regime.
vs previous: The previous approach copies the teacher's attention projections cold, leaving the student's recurrent gates (decay, write, output) unset — so it wastes tokens repairing a bad start; Taylor-Calibrate derives those gates from the teacher's attention statistics first, opening up to 88× better zero-shot.

Jargon

Hybrid linear attention: A model where some attention layers are replaced by linear-attention layers that keep a fixed-size recurrent memory instead of a KV cache that grows with sequence length — so long context costs far less, at some quality risk.
Gated DeltaNet (GDN): The student architecture here: a gated linear-attention/recurrent layer whose memory is governed by learned decay, write, and output gates. It is what the softmax teacher gets converted into.
Softmax attention: The standard Transformer attention (the "teacher"). It compares every token against every other through a softmax over scores — accurate, but quadratic in sequence length and backed by a growing KV cache.
Distillation (conversion): Training a cheaper student to reproduce a stronger teacher's behavior. Here it converts a pretrained softmax Transformer into a hybrid linear-attention model rather than training one from scratch.
Recurrent gates: The student's hidden control knobs: decay (how fast old memory fades), the write gate (how much each token records), and the output gate (how much it reads back). Softmax attention has no explicit equivalents.
Taylor expansion: A way to locally approximate a function by its slope and curvature. Applied to the teacher's attention map, it yields statistics that predict the matching gate values — a principled starting point, not a random one.
Per-layer alignment: A short pre-training step that matches each converted layer's output to the teacher's on the same inputs, layer by layer, before global distillation begins.
Zero-shot: Performance measured straight after initialization, with no further training. A better zero-shot student means a better starting point — fewer tokens needed to recover full quality.

The news. On June 15, 2026, Together AI released Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation. The problem it tackles: converting a pretrained softmax Transformer into a hybrid linear-attention model (a Gated DeltaNet student) is brittle, because naively copying the teacher's attention projections leaves the student's recurrent decay, write, and output dynamics unspecified — so it starts in a poor regime and burns distillation tokens just fixing its initialization. Taylor-Calibrate instead uses Taylor-guided teacher-attention statistics to set those gates, then aligns each layer to the teacher. Across four teacher settings and three retained-layer policies, it reports up to an 88× stronger zero-shot student and 4.9–9.2× fewer training tokens to matched recovery. Read the paper →

Picture a long-running play whose lead is precise but exhausting to keep on stage every night. The theatre wants a cheaper understudy who can carry the same show. Hand the understudy only the script — the lines — and they'll stumble, because the script never says how long to hold a pause, when to step forward, or how hard to react. That unwritten part is the blocking and timing, and it turns out a softmax-attention Transformer — the lead — has no explicit knobs for it at all. Its replacement does: a Gated DeltaNet student is a linear-attention layer whose memory is run by three gates — a decay (how fast old memory fades), a write gate (how much each token records), and an output gate (how much it reads back). Those gates are the blocking and timing.

So when you convert the Transformer into the student, the obvious move — copy the Q, K, V projections straight across — hands over the lines but leaves the blocking blank. The student opens in a terrible state and has to claw its way back to where the teacher already was, paying for the climb in distillation tokens.

Taylor-Calibrate sets the blocking by watching the lead perform. It runs a Taylor expansion of the teacher's attention map and uses those statistics to set the student's value projection, memory timescale (the decay bias), write gate, and output gate — instead of copying projections blindly. Then a short per-layer alignment matches each converted layer's output to the teacher's on the same inputs — a scene-by-scene dress rehearsal — before any global distillation begins. In the authors' framing:

"Taylor-Calibrate uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then performs a short per-layer alignment step to match each converted layer to the teacher output."

Why bother converting at all? Because a linear-attention layer keeps a fixed-size recurrent state instead of a KV cache that grows with every token — so long-context inference gets cheaper and far lighter on memory bandwidth (the same pressure that pushes serving toward tricks like grouped-query attention). Training such a hybrid from scratch is costly, so the live trend is to convert an existing Transformer — which is exactly where the token bill was landing, and exactly what a better starting point shrinks.

How much does the head start save? Hold the recovery target fixed — the quality bar the converted student must clear. Naive conversion needs some budget of distillation tokens to get there; call it B. Taylor-Calibrate reaches the same bar with B ÷ 4.9 to B ÷ 9.2 — between ~20% and ~11% of the tokens. If B were 10 billion tokens (illustrative — the paper reports the ratio, not an absolute count), that's ≈1.1–2.0 billion for the same quality. And measured at the very start, before a single distillation step, the calibrated student is up to 88× better zero-shot than the cold-copied one — that head start is what the token savings buy.

Conversion step	Naive (copy projections)	Taylor-Calibrate
Value projection & gates	Copy attention projections; decay / write / output gates left at defaults	Taylor-guided teacher statistics set the value projection and the decay, write, and output gates
Before distillation (zero-shot)	Poor starting regime	~88× better zero-shot (representative ablation)
Pre-distillation step	None	Short per-layer alignment to the teacher's outputs
Tokens to matched target	Baseline budget `B`	~4.9–9.2× fewer
Validation breadth	—	4 teacher settings, 3 retained-layer policies

Goes deeper in: LLM Internals → Self-Attention → Computing Attention Scores and LLM Internals → KV Cache → Memory Cost

Related explainers

HydraHead — full + linear attention fused per head — the hybrid architecture this kind of conversion targets, mixed at the head level rather than the layer level
SubQ-1.1 — subquadratic sparse attention — a different route out of attention's quadratic cost
MiniMax-M3 MSA — block-sparse attention — yet another way to cut attention's memory and compute

Related explainers

Frequently Asked Questions