I/O-optimal approximate attention — near-linear vs n²
GPUThe news. On May 22, 2026, a new research paper derived I/O-aware variants of approximate attention whose data movement between SRAM and HBM scales almost-linearly in sequence length n in most realistic parameter regimes. The work builds on the 2023 Alman–Song approximate-attention framework, which gave controlled-error approximations of the attention matrix but did not analyze the I/O cost on hierarchical memory. The new contribution is twofold: an algorithm whose I/O is near-linear in n, and matching lower bounds proving no algorithm in the same regime can do strictly better.
Picture a librarian comparing books. Two shelves, hundreds of titles each. The librarian's desk is small — three books fit on it at once. To produce an exact pairwise similarity matrix, the librarian must pull every pair to the desk, compare them, and put them back: every trip is a tiny piece of work, but the number of trips grows with the square of the number of books. That is the loop FlashAttention runs on a GPU. The "shelves" are HBM — large, slow, where the full Q, K, V tensors live. The "desk" is SRAM — tiny, fast, where the actual softmax happens. FlashAttention is already IO-aware: it tiles the n × n matrix into SRAM-sized blocks and processes each block exactly once, which is the best you can do if you insist on reading the entire matrix. The librarian still has to fetch every pair.
The new paper asks: what if the librarian doesn't have to read every page? Approximate attention swaps the exact softmax for a controlled-error stand-in — bounded to within an explicit error parameter ε in the analysis, cheap enough that the librarian only needs to inspect a structured sample of pairs. The paper makes a precise complexity claim; what an ε-close attention matrix means for the loss curve of any specific model is an empirical question this work doesn't open. The compute story for approximate attention was already known: Alman and Song's 2023 framework showed you can drive flop count below n² with bounded error. What was missing was the I/O story — bytes moved between SRAM and HBM is what wall-clock long-context attention actually pays for, and a flop-cheap algorithm that shuffles bytes the same way as the exact one saves nothing in practice.
This paper closes that gap. The authors derive algorithms whose I/O between SRAM and HBM scales as roughly n · log n instead of n², in most realistic parameter regimes — and then prove matching lower bounds, showing no algorithm in the same regime can shave more without strengthening the problem assumptions. That second half is what makes the result load-bearing: it is not "we found a faster algorithm and there might be a better one." It is "this is near-optimal, and the constant factors are all that's left."
Where the I/O ratio actually shows up
Hold three variables fixed. Head dimension d = 128. Per-element size 4 bytes (fp32 partial sums during the softmax accumulation). SRAM size M = 220 KB (an H100 SM's shared-memory budget). Now sweep sequence length n. FlashAttention's I/O scales as roughly 4 · n² · d bytes — every pair of tokens contributes a tile read. The new algorithm's I/O scales as roughly 4 · n · log₂(n) · d bytes — only a sampled set of tile reads. At a small n = 8K the ratio is already ~600× (illustrative — actual ratios depend on the specific approximation parameters); at n = 128K it crosses ~8,000×; at n = 2M it is on the order of ~98,000× (illustrative). The compute story scales the same way — quadratic gives way to near-linear — but the I/O story is the one that hits wall-clock first, because long-context attention is memory-bandwidth-bound, not compute-bound. The hero animation above plots both curves on a log-log axis as n sweeps; the dot on the pink curve climbs above the chart frame around n = 128K, while the indigo dot barely moves.
| Property | FlashAttention (exact) | This paper (approximate) |
|---|---|---|
| Output | Exact softmax-attention matrix | Approximate matrix within controlled error ε |
| Compute | Θ(n² · d) ~setup-dependent, illustrative | Roughly Õ(n · d), per the Alman–Song line ~setup-dependent, illustrative |
| I/O between SRAM and HBM | Θ(n² · d / √M) ~from FlashAttention paper | Near-linear in n in most realistic parameter regimes ~per this paper |
| Lower bound | Matches the upper bound — proven IO-optimal for exact attention | Matching I/O lower bound proven — near-optimal among approximate algorithms in the same regime |
| When it pays off | Any sequence length — the canonical kernel ~setup-dependent | Long contexts where bandwidth dominates and approximation error is acceptable ~setup-dependent, illustrative |
| Drop-in for production? | Yes — widely deployed since 2022 | Not yet — theoretical algorithm; production kernels are the next step ~per this paper |
This is the natural extension of the I/O-complexity story that motivates FlashAttention in the first place. FlashAttention's contribution was to make exact attention IO-optimal — every byte that has to move actually moves only once. This paper steps up one rung: if you are willing to accept approximation, the byte budget itself shrinks from n² to near-linear, and that shrinkage is provably the best you can do.
It is a different lever from PagedAttention, which manages where the K/V cache lives across requests but does not change the per-step attention I/O. It is also different from KV-cache quantization and prefix caching, which shrink the bytes that get moved per token but leave the n² scaling intact. This paper attacks the exponent of n directly — a structurally different win that compounds with all of the above at long context.
Goes deeper in: GPU & CUDA → Operator Fusion & FlashAttention → I/O Complexity
Related explainers
- vLLM v0.20 — FlashAttention 4 packing — exact attention with packing to remove padding waste; still n² in I/O
- vLLM v0.20 — TurboQuant 2-bit KV — orthogonal lever: shrink per-token KV bytes, not the matrix size
- AOIAYN — persistent KV cache across queries — yet another orthogonal lever: amortize the prefill so attention only runs once per token
- SGLang v0.5.12 — MLA and TokenSpeed — production-side throughput wins on the exact-attention side
- PSD — parallel speculative decoding for diffusion LLMs — different inference-shape win on a different model class