What is controllable attention sparsity (ConSA)?

ConSA is a 2026 method (arXiv 2606.18056) that learns, under a user-set sparsity budget, which of a model's attention units use full attention and which use a cheaper sliding window. Instead of a hand-coded rule for which layers are sparse, it hangs a trainable binary mask on each unit and learns the allocation jointly with the model — using L0 regularization to make the on/off choice differentiable and an augmented Lagrangian term to hit the exact budget. It can decide at layer or per-KV-head granularity.

Why allocate full vs sliding-window attention per head?

Full attention captures long-range dependencies but its KV cache and compute grow with context length, so at long context the full heads dominate cost. Sliding-window heads stay cheap (KV capped at the window) but can't see far back. Mixing the two lets a model spend its memory budget where long range actually matters — and ConSA finds that deciding per KV-head, rather than per whole layer, gives a better allocation for the same budget.

How is ConSA different from hand-coded hybrid-attention rules?

Hybrid models like sliding-window-plus-global pick which layers are full by a fixed heuristic chosen at design time. ConSA replaces the heuristic with a learned, budgeted optimization: the full-vs-windowed choice is a trainable mask, an L0 penalty counts the full units, and an augmented Lagrangian enforces the exact sparsity target. The authors report learned allocations beat rule-based baselines on 0.6B and 1.7B models, and that the model prefers sliding windows in lower layers with full attention concentrated in a contiguous middle block.

ConSA learns where to put full vs sliding-window attention per head — Controllable attention sparsity

Jargon

Full attention (FA): A head where each token attends to every earlier token. It captures long-range dependencies, but its KV cache and compute grow with the sequence length — the cost that hurts at long context.
Sliding-window attention (SWA): A head where each token attends only to the last W tokens, applied as a banded attention mask. Its KV stays capped at the window size W as the context grows, so it can't blow up the way a full head does — but it can't see far back either.
Sparsity budget / target: A number you set — say "30% of heads may be full." ConSA treats it as a hard constraint and allocates full vs windowed heads to hit it exactly, so you dial the memory/quality trade rather than discovering it after the fact.
L0 regularization (binary masks): A way to learn discrete on/off choices with gradients. Each unit gets a binary mask (full = 1, windowed = 0); an L0 penalty counts and pushes down the number of "full" units, so the model can be trained to choose a sparse set instead of choosing being a non-differentiable switch.
Augmented Lagrangian: An optimization trick that turns "hit exactly this budget" into a trainable penalty. Unlike a soft preference that merely nudges toward fewer full heads, it enforces the precise sparsity target as the model trains.
KV-head-wise vs layer-wise: The granularity of the choice. Layer-wise decides full/windowed for a whole layer at once; KV-head-wise decides it per KV-head — finer-grained, and ConSA reports it works better.

The news. On June 16, 2026, the ConSA paper (Controllable Sparse Attention, arXiv 2606.18056) proposed learning the assignment of full attention vs sliding-window attention across a model under a user-specified sparsity budget, rather than hand-picking which layers are sparse. Using L0 regularization with binary masks and an augmented Lagrangian constraint, it optimizes the allocation at layer or KV-head granularity. On 0.6B and 1.7B LLMs the authors report that learned allocations beat rule-based baselines, that KV-head-wise beats layer-wise, and that the learned patterns cluster sliding windows in the lower layers and full attention in a contiguous middle block. Read the paper →

Picture a team of proofreaders racing a deadline. Each one can either reread the whole draft from page one — catching anything that depends on something said much earlier — or just glance at the last page and move on. The full rereads are thorough but slow, so the editor can only afford a handful of them. The lazy fix is to make everyone glance, but then the cross-references that span the whole document slip through. The usual real fix is a seating rule — "desks one through four reread everything, the rest glance" — picked once and frozen. ConSA's move is to stop guessing the seating chart and instead learn, from the work itself, which desks actually need the full reread.

In a transformer that team is the set of attention heads, and the choice is full attention versus a sliding window. Full attention lets a head look back over every prior token, which is what captures long-range structure — but its KV cache and compute grow with the context length, so at long context the full heads are exactly what blows the memory budget. Sliding-window attention caps each head at the last W tokens via a banded attention mask, so its KV stays bounded by W however long the context grows — at the price of never seeing far back. Hybrid models mix the two, but which heads get which mode has been a hand-coded heuristic, the same frozen seating chart for every model.

ConSA turns that allocation into something the model learns under a budget you set. It hangs a trainable binary switch on each attention unit — full (1) or windowed (0) — and learns those switches jointly with the weights. The trick to making an on/off switch differentiable is L0 regularization, which penalizes the count of "full" units so gradients can push the network toward fewer of them; an augmented Lagrangian term then enforces your exact sparsity target instead of merely preferring it. Crucially, ConSA can make this choice per KV-head, not just per layer — and the per-head version wins. Left to learn, the model doesn't reproduce the textbook rule: it parks windows in the lower layers and a contiguous block of full attention in the middle, a structure the authors found rather than assumed.

What makes this notable is that it reframes "where should attention be sparse?" from a design-time guess into a trained, budgeted decision — and shows a hand-coded heuristic was leaving quality on the table. The serving world is already shipping hybrid full/windowed attention to fit long context in memory; ConSA says the layout of that hybrid is itself a learnable knob, and hands you the dial.

Because a windowed head's KV is capped at its window, the sparsity budget becomes a direct dial on memory — every head you flip from full to windowed buys back cache. Walk one layer to see it. Say it has 8 KV-heads over a 32,768-token context, and your budget allows 3 full / 5 windowed with a window of W = 512. Each windowed head caps its KV at the last 512 tokens instead of all 32,768 — about a 64× smaller KV footprint per head (32768 / 512). So the layer's KV cost falls from 8 full heads to the equivalent of 3 + 5 × (512 / 32768) ≈ 3.08 full heads — roughly 2.6× less KV for that layer (illustrative numbers; the paper sets the budget per model, not these exact values).

Approach	Each head looks back over…	KV cost per head	Who decides full vs windowed
Full attention everywhere	Every earlier token	Grows with context length	Nobody — all heads are full
Sliding-window everywhere	Only the last `W`	Capped at `W`	Nobody — all heads are windowed
Hand-coded hybrid (e.g. "every 4th layer global")	Mixed, by a fixed rule	Mixed	A frozen heuristic, same for every model
ConSA (this paper)	Mixed, by a learned allocation	Mixed, to hit your budget exactly	Learned per KV-head under an L0 + augmented-Lagrangian budget (reported: beats rule-based on 0.6B & 1.7B)

The honest caveats: the published results are on small models (0.6B and 1.7B), so whether the learned layout transfers to frontier-scale models is open, and the win over heuristics is reported qualitatively in the summary — check the paper for the exact perplexity-vs-budget curves. And learning the masks adds training-time machinery (the L0 penalty and the Lagrangian schedule) that a fixed rule doesn't need; the payoff has to clear that cost. But the core lesson is durable: once "which heads are sparse" is a number you can train against a budget, it stops being folklore — and the model's own preferred layout, windows-low and full-attention-mid, is a hint worth keeping even if you never run the optimizer.

Goes deeper in: LLM Internals → Attention → Masking & sliding windows

Related explainers

SubQ 1.1 — subquadratic sparse attention — makes every head sparse uniformly to reach a 12M-token context; ConSA instead chooses full vs windowed per head under a budget.
MiniMax-M3 MSA — block-sparse attention — a fixed block-sparse pattern; ConSA learns the allocation rather than fixing the pattern up front.
Tangram — per-head KV budgets — allocates how much KV each head keeps; ConSA allocates the attention type (full vs window) per head — a complementary axis.
Keye-VL 2.0 — DeepSeek sparse attention for video — applies a sparse-attention scheme to long video; the same "don't attend to everything" pressure ConSA budgets across heads.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based