The news. On June 16, 2026, the ConSA paper (Controllable Sparse Attention, arXiv 2606.18056) proposed learning the assignment of full attention vs sliding-window attention across a model under a user-specified sparsity budget, rather than hand-picking which layers are sparse. Using L0 regularization with binary masks and an augmented Lagrangian constraint, it optimizes the allocation at layer or KV-head granularity. On 0.6B and 1.7B LLMs the authors report that learned allocations beat rule-based baselines, that KV-head-wise beats layer-wise, and that the learned patterns cluster sliding windows in the lower layers and full attention in a contiguous middle block. Read the paper →

Picture a team of proofreaders racing a deadline. Each one can either reread the whole draft from page one — catching anything that depends on something said much earlier — or just glance at the last page and move on. The full rereads are thorough but slow, so the editor can only afford a handful of them. The lazy fix is to make everyone glance, but then the cross-references that span the whole document slip through. The usual real fix is a seating rule — "desks one through four reread everything, the rest glance" — picked once and frozen. ConSA's move is to stop guessing the seating chart and instead learn, from the work itself, which desks actually need the full reread.

In a transformer that team is the set of attention heads, and the choice is full attention versus a sliding window. Full attention lets a head look back over every prior token, which is what captures long-range structure — but its KV cache and compute grow with the context length, so at long context the full heads are exactly what blows the memory budget. Sliding-window attention caps each head at the last W tokens via a banded attention mask, so its KV stays bounded by W however long the context grows — at the price of never seeing far back. Hybrid models mix the two, but which heads get which mode has been a hand-coded heuristic, the same frozen seating chart for every model.

ConSA turns that allocation into something the model learns under a budget you set. It hangs a trainable binary switch on each attention unit — full (1) or windowed (0) — and learns those switches jointly with the weights. The trick to making an on/off switch differentiable is L0 regularization, which penalizes the count of "full" units so gradients can push the network toward fewer of them; an augmented Lagrangian term then enforces your exact sparsity target instead of merely preferring it. Crucially, ConSA can make this choice per KV-head, not just per layer — and the per-head version wins. Left to learn, the model doesn't reproduce the textbook rule: it parks windows in the lower layers and a contiguous block of full attention in the middle, a structure the authors found rather than assumed.

What makes this notable is that it reframes "where should attention be sparse?" from a design-time guess into a trained, budgeted decision — and shows a hand-coded heuristic was leaving quality on the table. The serving world is already shipping hybrid full/windowed attention to fit long context in memory; ConSA says the layout of that hybrid is itself a learnable knob, and hands you the dial.

Because a windowed head's KV is capped at its window, the sparsity budget becomes a direct dial on memory — every head you flip from full to windowed buys back cache. Walk one layer to see it. Say it has 8 KV-heads over a 32,768-token context, and your budget allows 3 full / 5 windowed with a window of W = 512. Each windowed head caps its KV at the last 512 tokens instead of all 32,768 — about a 64× smaller KV footprint per head (32768 / 512). So the layer's KV cost falls from 8 full heads to the equivalent of 3 + 5 × (512 / 32768) ≈ 3.08 full heads — roughly 2.6× less KV for that layer (illustrative numbers; the paper sets the budget per model, not these exact values).

ApproachEach head looks back over…KV cost per headWho decides full vs windowed
Full attention everywhereEvery earlier tokenGrows with context lengthNobody — all heads are full
Sliding-window everywhereOnly the last WCapped at WNobody — all heads are windowed
Hand-coded hybrid (e.g. "every 4th layer global")Mixed, by a fixed ruleMixedA frozen heuristic, same for every model
ConSA (this paper)Mixed, by a learned allocationMixed, to hit your budget exactlyLearned per KV-head under an L0 + augmented-Lagrangian budget (reported: beats rule-based on 0.6B & 1.7B)

The honest caveats: the published results are on small models (0.6B and 1.7B), so whether the learned layout transfers to frontier-scale models is open, and the win over heuristics is reported qualitatively in the summary — check the paper for the exact perplexity-vs-budget curves. And learning the masks adds training-time machinery (the L0 penalty and the Lagrangian schedule) that a fixed rule doesn't need; the payoff has to clear that cost. But the core lesson is durable: once "which heads are sparse" is a number you can train against a budget, it stops being folklore — and the model's own preferred layout, windows-low and full-attention-mid, is a hint worth keeping even if you never run the optimizer.

Goes deeper in: LLM Internals → Attention → Masking & sliding windows

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based