What is subquadratic sparse attention?

It is attention whose compute grows slower than the square of the sequence length — close to linear — by having each token attend to only a small, learned subset of the other tokens instead of all of them. Standard dense attention is quadratic: every token compares itself to every other, so doubling the context roughly quadruples the work. SubQ 1.1 Small uses a learned sparse pattern (Subquadratic Sparse Attention) so the cost curve bends to near-linear, which is what lets it support a 12-million-token context. The exact learned-sparsity mechanism is not fully documented in the model card, so the scaling claim is the load-bearing part.

Why does it matter for long context?

Because the cost of attention is what caps how long a context window can be. Dense attention's quadratic scaling means a million tokens is already expensive and twelve million is impractical — the n² wall. If attention scales near-linearly instead, doubling the context only about doubles the cost, so the window can grow an order of magnitude without the compute exploding. SubQ reports a 12M-token window with near-perfect needle-in-a-haystack recall at 1M, 2M, 6M, and 12M, plus 64.5× less compute than dense attention and 56× faster than FlashAttention-2 at 1M tokens.

How is it different from FlashAttention or block-sparse attention?

FlashAttention keeps attention exact — every token still attends to every other — and wins by reorganizing memory traffic; its compute is still quadratic in the sequence length. Block-sparse top-k methods like MiniMax's MSA cut compute by scoring blocks of the cache and attending to only a few per query, but they are usually demonstrated at a fixed ~1M context. Subquadratic Sparse Attention goes after the scaling exponent itself: a learned per-token sparse reach whose total cost grows about linearly, which is what lets the context window stretch to 12M rather than just making 1M cheaper. The trade-off is that any sparse method can miss a token that mattered, so its long-range recall numbers are the ones to scrutinize.

SubQ 1.1 Small hits a 12M-token context — Subquadratic sparse attention

TL;DR

What is it: The news is SubQ 1.1 Small, a model from the startup Subquadratic Inc., built on Subquadratic Sparse Attention — a learned sparse attention whose compute grows near-linearly with context length instead of with its square.
Why it’s needed: Attention is the part of a transformer that lets every token look at the others, and its cost is what caps how long a context window can grow. Bending that cost from quadratic to near-linear is what lets the window stretch to 12 million tokens — whole codebases or books at once.
vs previous: Standard dense attention compares every token to every other token, so doubling the context roughly quadruples the work — the n² wall that strands most models near a million tokens. SubQ has each token attend to only a small learned subset, so the work scales about linearly instead.

Jargon

Quadratic (n²) scaling: Cost that grows with the square of the input length. Double the tokens, and the work roughly quadruples. Dense attention is quadratic because every token compares itself to every other one. Full explainer →
Subquadratic scaling: Any cost that grows slower than n² — here, close to linear (n), so doubling the context only about doubles the work. The whole point of SubQ's attention.
Dense attention: The original transformer attention: an n × n grid where every token attends to every other token. Exact, but its cost is quadratic in the sequence length.
Sparse attention: Attention where each token looks at only a chosen subset of the others, leaving most of the n × n grid empty. SubQ's subset is learned rather than fixed by a hand-written pattern.
Context window: The number of tokens a model can read at once. SubQ 1.1's is 12 million — large enough to hold whole repositories. Full explainer →
RULER: A long-context benchmark that mixes retrieval and reasoning across the whole window. SubQ reports 99.12% on RULER multi-task retrieval at 128K tokens.
Needle-in-a-haystack: A long-context test that hides one fact in a huge document and asks the model to find it. SubQ reports near-perfect recall at 1M, 2M, 6M, and 12M tokens.
FlashAttention-2: The widely deployed exact-attention kernel that is IO-aware but still does quadratic work. SubQ reports being 56× faster than it at 1M tokens. Full explainer →

The news. On June 16, 2026, the startup Subquadratic Inc. published the model card for SubQ 1.1 Small, a language model built on Subquadratic Sparse Attention (SSA) — a learned sparse attention whose compute scales near-linearly with sequence length. It supports a 12-million-token context with near-perfect needle-in-a-haystack recall at 1M/2M/6M/12M tokens, and posts 99.12% on RULER at 128K, 85.4% on GPQA Diamond, and 89.7% on LiveCodeBench (pass@4). The company reports 64.5× less compute than dense attention and 56× faster than FlashAttention-2 at 1M tokens. Read the technical report →

Picture a party where every guest is expected to shake hands with every other guest. In a small room that's easy — but the number of handshakes grows with the square of the crowd, so each time the party doubles, the handshaking roughly quadruples. Pack a stadium and the handshakes alone would take longer than the party. That is exactly the bind a transformer is in. Each token is a guest, each handshake is one attention comparison between two tokens, and "everyone greets everyone" is dense attention — the original, exact mechanism. Its cost grows with n², so a context of a million tokens is already expensive and twelve million is a wall.

SubQ's move is to stop making every guest greet everyone. In Subquadratic Sparse Attention, each token attends to only a small, learned set of other tokens — a chosen few, not the whole crowd. Most of the n × n grid of possible handshakes is simply never computed. Because each token's reach stays roughly bounded as the crowd grows, the total work grows about linearly with the sequence length rather than with its square. That is what "subquadratic" means here: the cost curve bends down from a parabola to nearly a straight line, and a straight line is something you can extend to a stadium.

The diagram above is the quadratic cost made literal: with each new token, dense attention adds a full row of comparisons against everything before it, and the shaded triangle — the total work — fills in as roughly n²/2. Subquadratic attention keeps only a thin, learned slice of that triangle. The exact learned-sparsity recipe is not fully spelled out in SubQ's model card, so the precise pattern is best read as "reported, not yet reproducible" — but the scaling claim is the load-bearing one, and it is the same lever that motivates FlashAttention's IO-aware design and the whole memory-bandwidth-bound story of long-context inference.

Where the n² wall actually bites

Hold the per-token reach fixed and walk the numbers. Dense attention compares every token to all n others, so one layer does on the order of n² comparisons. At n = 1M that is about 10¹² — a trillion pairwise comparisons; at n = 12M it is about 1.44 × 10¹⁴. You added 12× the tokens but paid roughly 144× the work, because cost scales with the square. Now give each token a fixed budget of, say, k = 4,096 neighbors (illustrative — SubQ does not publish its sparsity width). The layer now does about n · k comparisons: 1M tokens → ~4 × 10⁹, and 12M tokens → ~4.9 × 10¹⁰. Twelve times the tokens, about 12× the work — the square is gone. That linear slope — paying roughly 12× the work for 12× the tokens — is the shape behind SubQ's reported 64.5× compute cut versus dense attention and 56× speedup over FlashAttention-2 at 1M, and it is why the window can stretch to 12M instead of stalling near one.

Approach	Compute scaling	What it changes	Context reached
Dense attention	Θ(n²) ~textbook	Exact: every token attends to every other	Practical limit varies; cost climbs steeply with length (setup-dependent)
FlashAttention-2 (exact)	Θ(n²) compute, IO-aware ~FA2 paper	Reorders the schedule to cut memory traffic; math unchanged	Same n² wall, pushed by constants
Block-sparse top-k (e.g. MiniMax MSA)	Sub-dense at fixed n (cuts per-token attention compute, per MSA)	Scores KV blocks, attends to a chosen few per query	~1M class
Subquadratic Sparse Attention (SubQ)	Near-linear in n ~per SubQ report	Learned sparse reach per token; scaling itself bends to ~linear	12M, near-perfect needle recall ~per SubQ report

This is a different lever from the recent I/O-optimal approximate attention result, which attacks the bytes moved between SRAM and HBM rather than the compute exponent, and from block-sparse top-k attention, which cuts compute at a fixed context length but is still usually demonstrated around 1M tokens. SubQ's claim is the more aggressive one: bend the scaling itself so the window can grow an order of magnitude. The catch is the usual one for sparse attention — a learned subset can miss a token that mattered, which is why the near-perfect needle-in-a-haystack recall at 12M is the number to watch, not just the speedup.

Goes deeper in: LLM Internals → Attention → Attention scores

Related explainers

I/O-optimal approximate attention — a complementary lever: cut the bytes moved (near-linear I/O), not the compute exponent
MiniMax M3 — block-sparse top-k attention — sparse attention that cuts compute at a fixed 1M context rather than bending the scaling
FlashMemory — lookahead sparse attention — sparsity aimed at the KV cache footprint instead of the compute
Keye-VL 2.0 — DeepSeek sparse attention for video — the same sparse-attention idea applied to long video

Continue in trackLLM Internals: Attention scores

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based