The news. On June 16, 2026, the startup Subquadratic Inc. published the model card for SubQ 1.1 Small, a language model built on Subquadratic Sparse Attention (SSA) — a learned sparse attention whose compute scales near-linearly with sequence length. It supports a 12-million-token context with near-perfect needle-in-a-haystack recall at 1M/2M/6M/12M tokens, and posts 99.12% on RULER at 128K, 85.4% on GPQA Diamond, and 89.7% on LiveCodeBench (pass@4). The company reports 64.5× less compute than dense attention and 56× faster than FlashAttention-2 at 1M tokens. Read the technical report →
Picture a party where every guest is expected to shake hands with every other guest. In a small room that's easy — but the number of handshakes grows with the square of the crowd, so each time the party doubles, the handshaking roughly quadruples. Pack a stadium and the handshakes alone would take longer than the party. That is exactly the bind a transformer is in. Each token is a guest, each handshake is one attention comparison between two tokens, and "everyone greets everyone" is dense attention — the original, exact mechanism. Its cost grows with n², so a context of a million tokens is already expensive and twelve million is a wall.
SubQ's move is to stop making every guest greet everyone. In Subquadratic Sparse Attention, each token attends to only a small, learned set of other tokens — a chosen few, not the whole crowd. Most of the n × n grid of possible handshakes is simply never computed. Because each token's reach stays roughly bounded as the crowd grows, the total work grows about linearly with the sequence length rather than with its square. That is what "subquadratic" means here: the cost curve bends down from a parabola to nearly a straight line, and a straight line is something you can extend to a stadium.
The diagram above is the quadratic cost made literal: with each new token, dense attention adds a full row of comparisons against everything before it, and the shaded triangle — the total work — fills in as roughly n²/2. Subquadratic attention keeps only a thin, learned slice of that triangle. The exact learned-sparsity recipe is not fully spelled out in SubQ's model card, so the precise pattern is best read as "reported, not yet reproducible" — but the scaling claim is the load-bearing one, and it is the same lever that motivates FlashAttention's IO-aware design and the whole memory-bandwidth-bound story of long-context inference.
Where the n² wall actually bites
Hold the per-token reach fixed and walk the numbers. Dense attention compares every token to all n others, so one layer does on the order of n² comparisons. At n = 1M that is about 10¹² — a trillion pairwise comparisons; at n = 12M it is about 1.44 × 10¹⁴. You added 12× the tokens but paid roughly 144× the work, because cost scales with the square. Now give each token a fixed budget of, say, k = 4,096 neighbors (illustrative — SubQ does not publish its sparsity width). The layer now does about n · k comparisons: 1M tokens → ~4 × 10⁹, and 12M tokens → ~4.9 × 10¹⁰. Twelve times the tokens, about 12× the work — the square is gone. That linear slope — paying roughly 12× the work for 12× the tokens — is the shape behind SubQ's reported 64.5× compute cut versus dense attention and 56× speedup over FlashAttention-2 at 1M, and it is why the window can stretch to 12M instead of stalling near one.
| Approach | Compute scaling | What it changes | Context reached |
|---|---|---|---|
| Dense attention | Θ(n²) ~textbook | Exact: every token attends to every other | Practical limit varies; cost climbs steeply with length (setup-dependent) |
| FlashAttention-2 (exact) | Θ(n²) compute, IO-aware ~FA2 paper | Reorders the schedule to cut memory traffic; math unchanged | Same n² wall, pushed by constants |
| Block-sparse top-k (e.g. MiniMax MSA) | Sub-dense at fixed n (cuts per-token attention compute, per MSA) | Scores KV blocks, attends to a chosen few per query | ~1M class |
| Subquadratic Sparse Attention (SubQ) | Near-linear in n ~per SubQ report | Learned sparse reach per token; scaling itself bends to ~linear | 12M, near-perfect needle recall ~per SubQ report |
This is a different lever from the recent I/O-optimal approximate attention result, which attacks the bytes moved between SRAM and HBM rather than the compute exponent, and from block-sparse top-k attention, which cuts compute at a fixed context length but is still usually demonstrated around 1M tokens. SubQ's claim is the more aggressive one: bend the scaling itself so the window can grow an order of magnitude. The catch is the usual one for sparse attention — a learned subset can miss a token that mattered, which is why the near-perfect needle-in-a-haystack recall at 12M is the number to watch, not just the speedup.
Goes deeper in: LLM Internals → Attention → Attention scores
Related explainers
- I/O-optimal approximate attention — a complementary lever: cut the bytes moved (near-linear I/O), not the compute exponent
- MiniMax M3 — block-sparse top-k attention — sparse attention that cuts compute at a fixed 1M context rather than bending the scaling
- FlashMemory — lookahead sparse attention — sparsity aimed at the KV cache footprint instead of the compute
- Keye-VL 2.0 — DeepSeek sparse attention for video — the same sparse-attention idea applied to long video