What is head-axis attention hybridization?

It is mixing full attention and linear attention within a single transformer layer by deciding head-by-head, instead of making a whole layer one or the other. Each attention head in a layer learns a different job, and HydraHead keeps exact full attention only on the heads that are retrieval-critical — the ones that need to look up an exact earlier token — while converting the rest to cheap linear attention. A scale-normalized fusion module reconciles the two outputs so they can coexist in one layer.

Why does mixing attention per head beat per layer?

Because heads inside the same layer do different jobs, so forcing a whole layer to be all-full or all-linear is a blunt choice. Deciding per head lets the model place its few expensive full-attention heads exactly where lookups happen and make everything else linear. HydraHead reports that a lean 7:1 linear-to-full ratio chosen per head matches the quality of a coarser 3:1 ratio chosen per layer — roughly half as many expensive heads at matched quality — and posts a 69% improvement over baseline at 512K context, trained on only 15B tokens.

How is HydraHead different from sparse attention like SubQ or MiniMax MSA?

Sparse-attention methods keep one kind of attention but skip most token-to-token comparisons — SubQ bends the scaling toward linear, and MiniMax MSA attends to only a few KV blocks per query. HydraHead instead runs two different attention mechanisms side by side inside a layer: exact full attention on some heads and linear attention on the rest. The lever is the head axis — which heads are full versus linear — not which token pairs are computed, so it is closer to a per-head version of a full/linear hybrid than to block sparsity.

HydraHead fuses full and linear attention per head, not per layer — Head-axis attention hybridization

TL;DR

What is it: The news is the HydraHead paper, a hybrid-attention architecture that mixes full attention and linear attention inside a single transformer layer along the head axis — keeping exact full attention only for the heads that matter for retrieval.
Why it’s needed: Attention cost is what caps how long a context can grow, so mixing cheap linear attention with exact full attention is how long-context models stay affordable. Choosing the mix per head rather than per layer is a finer dial — it lets the model spend exact attention only where a head actually needs to look something up.
vs previous: The usual hybrid decides per layer: a whole layer is full or linear, a coarse 3:1 split. HydraHead decides per head, reaching a 7:1 linear-to-full ratio that matches the 3:1 layer-wise hybrid's quality because it can keep a full head exactly where one is needed.

Jargon

Full attention: The original softmax attention: every token compares itself to every other token, so it can recall any earlier token exactly — but its cost grows with the square of the sequence length. Full explainer →
Linear attention: Attention that replaces the softmax with a fixed-size running state, so cost grows about linearly with length. Cheap and constant-memory, but lossy — it can't pull back an exact far-away token the way full attention can.
Attention head: One of the parallel lanes inside a layer, each with its own query/key/value projection. A layer runs many heads at once, and they specialize in different jobs. Full explainer →
Retrieval-critical head: A head whose job is finding an exact earlier token. Convert it to linear attention and that lookup degrades, so HydraHead uses an interpretability-driven rule to keep these heads full.
Hybrid attention (layer-wise vs head-wise): Mixing full and linear attention in one model. The knob HydraHead changes is the axis of the mix — which layers are full (the old way) versus which heads within a layer are full (the new way).
Scale-normalized fusion: The small module that reconciles full-attention and linear-attention outputs — they have different scales and distributions — so the two kinds of head can be combined cleanly inside one layer.

The news. On June 18, 2026, the HydraHead paper (arXiv:2606.20097) proposed a hybrid-attention architecture that mixes Full Attention and Linear Attention within each layer along the head axis, rather than choosing per layer. An interpretability-driven rule keeps Full Attention only for retrieval-critical heads and converts the rest to Linear Attention, with a scale-normalized fusion module reconciling the two. It reaches a 7:1 linear-to-full ratio that matches the quality of a coarser 3:1 layer-wise hybrid, and — trained on just 15B tokens — reports a 69% improvement over baseline at 512K context, approaching Qwen3.5, which has a native 256K context. Read the paper →

Picture a newsroom. Every desk is an attention head, and each desk has a choice about how to keep up with the story so far. A full-attention desk re-reads the entire archive before writing each new line — perfect recall of any earlier fact, but the reading piles up as the archive grows. A linear-attention desk instead keeps a small running digest it updates as it goes — fast and cheap, but it can't reach back and quote an exact sentence from a thousand pages ago. Full attention is the one that remembers exactly; linear attention is the one that stays cheap.

Mixing the two is old news — the question is how you assign the desks. The standard hybrid decides by floor: a whole layer is full, or a whole layer is linear, in some fixed ratio like 3 linear floors for every full one. But within any one floor, only a couple of desks are actually doing exact lookups; forcing the entire floor to re-read the archive (or denying all of them the archive) is a blunt instrument. The per-layer choice throws away the fact that heads inside the same layer do different jobs.

HydraHead makes the choice desk by desk. It uses an interpretability-driven rule to spot which heads are retrieval-critical — the ones whose lookups actually matter — and keeps full attention only on those, turning every other head linear. A scale-normalized fusion module then reconciles the two kinds of output inside the layer. Because the decision is per-head, HydraHead can run a far leaner 7:1 linear-to-full ratio and still match the quality of the coarse 3:1 layer-wise split — it simply places its few expensive heads exactly where the lookups happen.

The diagram makes the two axes literal: heads run in parallel across the width of a layer, and layers stack in depth. A per-layer hybrid colors whole rows; HydraHead colors individual cells.

Where the per-head dial earns its keep

Hold a model fixed and count heads. Take an (illustrative) 12-layer model with 16 heads per layer — 192 heads in total. A per-layer 3:1 hybrid makes 3 of every 4 layers linear: 9 linear layers and 3 full layers, which is 48 full heads — and a full head can never sit inside one of the otherwise-linear layers. HydraHead's per-head 7:1 split makes 7 of every 8 heads linear, which is just 24 full heads, and each one is free to live in any layer, right next to the tokens it must retrieve. Same reported quality, but — in this illustrative count — half the expensive full-attention heads (48 → 24) — because when placement is free, you need far fewer of them. (The 12-layer/16-head sizing is illustrative; the paper reports the 7:1 and 3:1 ratios, not a head count.)

Approach	Mixing axis	Full-attention share	What it costs
Vanilla attention	—	100% (every head full)	exact recall, but O(n²) per layer
Per-layer hybrid	layer	~25% of layers full (3:1, per HydraHead)	coarse: can't keep a lone full head inside a linear layer
HydraHead (per-head)	head	~12.5% of heads full (7:1, per HydraHead)	fine: full attention only on retrieval-critical heads

The table isn't a claim that linear attention is second-rate — most heads do just fine on a running digest, which is exactly why a 7:1 ratio works at all. The point is that a few heads genuinely need to reach back into the archive, and the per-layer axis can't isolate them. HydraHead's contribution is moving the full/linear decision onto the head axis, where it can be cheap and surgical at once — and pairing it with a scale-normalized fusion so the two head types coexist in one layer.

Goes deeper in: LLM Internals → Self-Attention → Multi-head attention

Related explainers

Gated DeltaNet-2 — Decoupled erase/write gates — the linear-attention side: how the cheap, fixed-state heads actually work
Parallax — Local-linear attention vs FlashAttention 2/3 — full vs linear attention framed as estimators, a complementary lens on the same trade-off
SubQ 1.1 — Subquadratic sparse attention — a different way to cheapen attention: bend the scaling itself instead of splitting heads
MiniMax M3 — block-sparse top-k attention — sparsity per KV block rather than a per-head full/linear mix

Continue in trackLLM Internals: Multi-head attention

Where the per-head dial earns its keep

Related explainers

Frequently Asked Questions