The news. On June 16, 2026, the ConSA paper (Controllable Sparse Attention, arXiv 2606.18056) proposed learning the assignment of full attention vs sliding-window attention across a model under a user-specified sparsity budget, rather than hand-picking which layers are sparse. Using L0 regularization with binary masks and an augmented Lagrangian constraint, it optimizes the allocation at layer or KV-head granularity. On 0.6B and 1.7B LLMs the authors report that learned allocations beat rule-based baselines, that KV-head-wise beats layer-wise, and that the learned patterns cluster sliding windows in the lower layers and full attention in a contiguous middle block. Read the paper →
Picture a team of proofreaders racing a deadline. Each one can either reread the whole draft from page one — catching anything that depends on something said much earlier — or just glance at the last page and move on. The full rereads are thorough but slow, so the editor can only afford a handful of them. The lazy fix is to make everyone glance, but then the cross-references that span the whole document slip through. The usual real fix is a seating rule — "desks one through four reread everything, the rest glance" — picked once and frozen. ConSA's move is to stop guessing the seating chart and instead learn, from the work itself, which desks actually need the full reread.
In a transformer that team is the set of attention heads, and the choice is full attention versus a sliding window. Full attention lets a head look back over every prior token, which is what captures long-range structure — but its KV cache and compute grow with the context length, so at long context the full heads are exactly what blows the memory budget. Sliding-window attention caps each head at the last W tokens via a banded attention mask, so its KV stays bounded by W however long the context grows — at the price of never seeing far back. Hybrid models mix the two, but which heads get which mode has been a hand-coded heuristic, the same frozen seating chart for every model.
ConSA turns that allocation into something the model learns under a budget you set. It hangs a trainable binary switch on each attention unit — full (1) or windowed (0) — and learns those switches jointly with the weights. The trick to making an on/off switch differentiable is L0 regularization, which penalizes the count of "full" units so gradients can push the network toward fewer of them; an augmented Lagrangian term then enforces your exact sparsity target instead of merely preferring it. Crucially, ConSA can make this choice per KV-head, not just per layer — and the per-head version wins. Left to learn, the model doesn't reproduce the textbook rule: it parks windows in the lower layers and a contiguous block of full attention in the middle, a structure the authors found rather than assumed.
What makes this notable is that it reframes "where should attention be sparse?" from a design-time guess into a trained, budgeted decision — and shows a hand-coded heuristic was leaving quality on the table. The serving world is already shipping hybrid full/windowed attention to fit long context in memory; ConSA says the layout of that hybrid is itself a learnable knob, and hands you the dial.
Because a windowed head's KV is capped at its window, the sparsity budget becomes a direct dial on memory — every head you flip from full to windowed buys back cache. Walk one layer to see it. Say it has 8 KV-heads over a 32,768-token context, and your budget allows 3 full / 5 windowed with a window of W = 512. Each windowed head caps its KV at the last 512 tokens instead of all 32,768 — about a 64× smaller KV footprint per head (32768 / 512). So the layer's KV cost falls from 8 full heads to the equivalent of 3 + 5 × (512 / 32768) ≈ 3.08 full heads — roughly 2.6× less KV for that layer (illustrative numbers; the paper sets the budget per model, not these exact values).
| Approach | Each head looks back over… | KV cost per head | Who decides full vs windowed |
|---|---|---|---|
| Full attention everywhere | Every earlier token | Grows with context length | Nobody — all heads are full |
| Sliding-window everywhere | Only the last W | Capped at W | Nobody — all heads are windowed |
| Hand-coded hybrid (e.g. "every 4th layer global") | Mixed, by a fixed rule | Mixed | A frozen heuristic, same for every model |
| ConSA (this paper) | Mixed, by a learned allocation | Mixed, to hit your budget exactly | Learned per KV-head under an L0 + augmented-Lagrangian budget (reported: beats rule-based on 0.6B & 1.7B) |
The honest caveats: the published results are on small models (0.6B and 1.7B), so whether the learned layout transfers to frontier-scale models is open, and the win over heuristics is reported qualitatively in the summary — check the paper for the exact perplexity-vs-budget curves. And learning the masks adds training-time machinery (the L0 penalty and the Lagrangian schedule) that a fixed rule doesn't need; the payoff has to clear that cost. But the core lesson is durable: once "which heads are sparse" is a number you can train against a budget, it stops being folklore — and the model's own preferred layout, windows-low and full-attention-mid, is a hint worth keeping even if you never run the optimizer.
Goes deeper in: LLM Internals → Attention → Masking & sliding windows
Related explainers
- SubQ 1.1 — subquadratic sparse attention — makes every head sparse uniformly to reach a 12M-token context; ConSA instead chooses full vs windowed per head under a budget.
- MiniMax-M3 MSA — block-sparse attention — a fixed block-sparse pattern; ConSA learns the allocation rather than fixing the pattern up front.
- Tangram — per-head KV budgets — allocates how much KV each head keeps; ConSA allocates the attention type (full vs window) per head — a complementary axis.
- Keye-VL 2.0 — DeepSeek sparse attention for video — applies a sparse-attention scheme to long video; the same "don't attend to everything" pressure ConSA budgets across heads.