The news. On June 18, 2026, the HydraHead paper (arXiv:2606.20097) proposed a hybrid-attention architecture that mixes Full Attention and Linear Attention within each layer along the head axis, rather than choosing per layer. An interpretability-driven rule keeps Full Attention only for retrieval-critical heads and converts the rest to Linear Attention, with a scale-normalized fusion module reconciling the two. It reaches a 7:1 linear-to-full ratio that matches the quality of a coarser 3:1 layer-wise hybrid, and — trained on just 15B tokens — reports a 69% improvement over baseline at 512K context, approaching Qwen3.5, which has a native 256K context. Read the paper →

Picture a newsroom. Every desk is an attention head, and each desk has a choice about how to keep up with the story so far. A full-attention desk re-reads the entire archive before writing each new line — perfect recall of any earlier fact, but the reading piles up as the archive grows. A linear-attention desk instead keeps a small running digest it updates as it goes — fast and cheap, but it can't reach back and quote an exact sentence from a thousand pages ago. Full attention is the one that remembers exactly; linear attention is the one that stays cheap.

Mixing the two is old news — the question is how you assign the desks. The standard hybrid decides by floor: a whole layer is full, or a whole layer is linear, in some fixed ratio like 3 linear floors for every full one. But within any one floor, only a couple of desks are actually doing exact lookups; forcing the entire floor to re-read the archive (or denying all of them the archive) is a blunt instrument. The per-layer choice throws away the fact that heads inside the same layer do different jobs.

HydraHead makes the choice desk by desk. It uses an interpretability-driven rule to spot which heads are retrieval-critical — the ones whose lookups actually matter — and keeps full attention only on those, turning every other head linear. A scale-normalized fusion module then reconciles the two kinds of output inside the layer. Because the decision is per-head, HydraHead can run a far leaner 7:1 linear-to-full ratio and still match the quality of the coarse 3:1 layer-wise split — it simply places its few expensive heads exactly where the lookups happen.

← heads (parallel / width) →Layer 3H1H2H3H4Layer 2H1H2H3H4Layer 1H1H2H3H4layers (sequential / depth)

The diagram makes the two axes literal: heads run in parallel across the width of a layer, and layers stack in depth. A per-layer hybrid colors whole rows; HydraHead colors individual cells.

Where the per-head dial earns its keep

Hold a model fixed and count heads. Take an (illustrative) 12-layer model with 16 heads per layer — 192 heads in total. A per-layer 3:1 hybrid makes 3 of every 4 layers linear: 9 linear layers and 3 full layers, which is 48 full heads — and a full head can never sit inside one of the otherwise-linear layers. HydraHead's per-head 7:1 split makes 7 of every 8 heads linear, which is just 24 full heads, and each one is free to live in any layer, right next to the tokens it must retrieve. Same reported quality, but — in this illustrative count — half the expensive full-attention heads (48 → 24) — because when placement is free, you need far fewer of them. (The 12-layer/16-head sizing is illustrative; the paper reports the 7:1 and 3:1 ratios, not a head count.)

ApproachMixing axisFull-attention shareWhat it costs
Vanilla attention100% (every head full)exact recall, but O(n²) per layer
Per-layer hybridlayer~25% of layers full (3:1, per HydraHead)coarse: can't keep a lone full head inside a linear layer
HydraHead (per-head)head~12.5% of heads full (7:1, per HydraHead)fine: full attention only on retrieval-critical heads

The table isn't a claim that linear attention is second-rate — most heads do just fine on a running digest, which is exactly why a 7:1 ratio works at all. The point is that a few heads genuinely need to reach back into the archive, and the per-layer axis can't isolate them. HydraHead's contribution is moving the full/linear decision onto the head axis, where it can be cheap and surgical at once — and pairing it with a scale-normalized fusion so the two head types coexist in one layer.

Goes deeper in: LLM Internals → Self-Attention → Multi-head attention

Related explainers

Continue in trackLLM Internals: Multi-head attention

Frequently Asked Questions