What is I/O-optimal approximate attention?

A class of approximate-attention algorithms whose data movement between fast on-chip SRAM and slow off-chip HBM scales near-linearly in sequence length n, instead of FlashAttention's quadratic n². The paper builds on Alman and Song's 2023 controlled-error approximation framework and adds the IO-aware analysis on top of it. It also proves a matching I/O lower bound — no algorithm in the same regime can shave more without changing the problem assumptions — which makes the result near-optimal up to constants. The work is theoretical for now; production kernels in the FlashAttention spirit are the natural next step.

How does this compare to FlashAttention 4 or other recent attention work?

FlashAttention 4 and the FlashAttention family in general keep the algorithm exact — every tile of the n by n attention matrix is read at least once — and squeeze constants by reorganizing the schedule (packing, masking, hardware-specific tricks). Its I/O is still quadratic in n. Approximate attention attacks the exponent directly: instead of reading every tile, sample a structured n log n tiles, accept a controlled error, and reduce the I/O budget itself. The two approaches are complementary — an IO-optimal approximate kernel implemented in the FlashAttention style would inherit the constant-factor wins on top of the asymptotic wins. PagedAttention, KV quantization, and prefix caching change different parts of the equation and combine with this paper rather than competing with it.

I/O-optimal approximate attention — Near-linear I/O vs FlashAttention

Q: Why does I/O matter more than flop count for long-context attention?

Because long-context attention on modern GPUs is bandwidth-bound, not compute-bound. The tensor cores can multiply matrices fast enough that the bottleneck shifts to how fast the kernel can stream Q, K, V tensors between HBM and SRAM. A flop-cheap algorithm that still moves bytes the same way as the exact one saves nothing in wall-clock terms. This is exactly why FlashAttention itself was a watershed — it did not change the asymptotic compute, it changed the I/O. The new paper is the next rung on the same ladder: trade exactness for an even cheaper I/O budget, and prove the new budget is near-optimal.

I/O-optimal approximate attention — near-linear vs n²

GPU

learnaivisually.com/ai-explained/io-optimal-approx-attention-near-linear-io

TL;DR

What is it: A new paper derives approximate-attention algorithms whose I/O — the bytes moved between fast on-chip SRAM and slow HBM — scales near-linearly in sequence length n, with matching I/O lower bounds proving the result is near-optimal up to constants.
Why it’s needed: Long-context inference and training spend the bulk of attention wall-clock time moving bytes, not computing flops — the I/O complexity of an attention kernel is what determines its cost at large n. Cutting I/O from n² to near-linear changes which sequence lengths are economically reachable.
vs previous: FlashAttention reads every tile of the n × n attention matrix between SRAM and HBM — quadratic data movement baked into its IO-aware design. The new algorithm sacrifices exact softmax for a controlled-error approximation and reads roughly n · log n tiles instead, and the paper proves no algorithm in the same regime can do strictly better.

Jargon

I/O complexity: The total bytes moved between fast on-chip memory and slow off-chip memory while running an algorithm. On a GPU this is SRAM ↔ HBM traffic. At long sequence lengths attention is bandwidth-bound, so I/O complexity — not flop count — determines wall-clock cost. Full explainer →
HBM: High-Bandwidth Memory — the large off-chip GPU memory (≈80 GB on H100/B200) where Q, K, V tensors live. Big, but slow relative to on-chip SRAM. Full explainer →
SRAM: The tiny on-chip cache (≈220 KB per SM on H100) where a kernel actually does its arithmetic. Roughly 6× the bandwidth of HBM but 360,000× smaller — every byte that goes through the kernel must first land here. Full explainer →
Approximate attention: A family of attention algorithms that computes the attention matrix up to a controlled error ε instead of exactly. Sidesteps the strict softmax by using a polynomial or sketch-based approximation that admits much cheaper data layouts.
Alman–Song framework: The 2023 theory paper that established the algorithmic toolkit for approximating attention with controlled error and analyzed its compute complexity. This paper inherits that toolkit and adds the missing I/O-aware analysis on top of it.
I/O lower bound: A mathematical proof that no algorithm in a given class can move fewer bytes than a stated bound. A matching upper-and-lower bound means the algorithm is optimal up to constants — you cannot shave more without changing the problem.
FlashAttention: The canonical exact-attention kernel that pioneered IO-aware design — tiles the n × n matrix into SRAM-sized blocks and processes each block once. Quadratic I/O is baked in because every tile of the matrix is read. Full explainer →

The news. On May 22, 2026, a new research paper derived I/O-aware variants of approximate attention whose data movement between SRAM and HBM scales almost-linearly in sequence length n in most realistic parameter regimes. The work builds on the 2023 Alman–Song approximate-attention framework, which gave controlled-error approximations of the attention matrix but did not analyze the I/O cost on hierarchical memory. The new contribution is twofold: an algorithm whose I/O is near-linear in n, and matching lower bounds proving no algorithm in the same regime can do strictly better.

Picture a librarian comparing books. Two shelves, hundreds of titles each. The librarian's desk is small — three books fit on it at once. To produce an exact pairwise similarity matrix, the librarian must pull every pair to the desk, compare them, and put them back: every trip is a tiny piece of work, but the number of trips grows with the square of the number of books. That is the loop FlashAttention runs on a GPU. The "shelves" are HBM — large, slow, where the full Q, K, V tensors live. The "desk" is SRAM — tiny, fast, where the actual softmax happens. FlashAttention is already IO-aware: it tiles the n × n matrix into SRAM-sized blocks and processes each block exactly once, which is the best you can do if you insist on reading the entire matrix. The librarian still has to fetch every pair.

The new paper asks: what if the librarian doesn't have to read every page? Approximate attention swaps the exact softmax for a controlled-error stand-in — bounded to within an explicit error parameter ε in the analysis, cheap enough that the librarian only needs to inspect a structured sample of pairs. The paper makes a precise complexity claim; what an ε-close attention matrix means for the loss curve of any specific model is an empirical question this work doesn't open. The compute story for approximate attention was already known: Alman and Song's 2023 framework showed you can drive flop count below n² with bounded error. What was missing was the I/O story — bytes moved between SRAM and HBM is what wall-clock long-context attention actually pays for, and a flop-cheap algorithm that shuffles bytes the same way as the exact one saves nothing in practice.

This paper closes that gap. The authors derive algorithms whose I/O between SRAM and HBM scales as roughly n · log n instead of n², in most realistic parameter regimes — and then prove matching lower bounds, showing no algorithm in the same regime can shave more without strengthening the problem assumptions. That second half is what makes the result load-bearing: it is not "we found a faster algorithm and there might be a better one." It is "this is near-optimal, and the constant factors are all that's left."

Where the I/O ratio actually shows up

Hold three variables fixed. Head dimension d = 128. Per-element size 4 bytes (fp32 partial sums during the softmax accumulation). SRAM size M = 220 KB (an H100 SM's shared-memory budget). Now sweep sequence length n. FlashAttention's I/O scales as roughly 4 · n² · d bytes — every pair of tokens contributes a tile read. The new algorithm's I/O scales as roughly 4 · n · log₂(n) · d bytes — only a sampled set of tile reads. At a small n = 8K the ratio is already ~600× (illustrative — actual ratios depend on the specific approximation parameters); at n = 128K it crosses ~8,000×; at n = 2M it is on the order of ~98,000× (illustrative). The compute story scales the same way — quadratic gives way to near-linear — but the I/O story is the one that hits wall-clock first, because long-context attention is memory-bandwidth-bound, not compute-bound. The hero animation above plots both curves on a log-log axis as n sweeps; the dot on the pink curve climbs above the chart frame around n = 128K, while the indigo dot barely moves.

Property	FlashAttention (exact)	This paper (approximate)
Output	Exact softmax-attention matrix	Approximate matrix within controlled error ε
Compute	Θ(n² · d) ~setup-dependent, illustrative	Roughly Õ(n · d), per the Alman–Song line ~setup-dependent, illustrative
I/O between SRAM and HBM	Θ(n² · d / √M) ~from FlashAttention paper	Near-linear in n in most realistic parameter regimes ~per this paper
Lower bound	Matches the upper bound — proven IO-optimal for exact attention	Matching I/O lower bound proven — near-optimal among approximate algorithms in the same regime
When it pays off	Any sequence length — the canonical kernel ~setup-dependent	Long contexts where bandwidth dominates and approximation error is acceptable ~setup-dependent, illustrative
Drop-in for production?	Yes — widely deployed since 2022	Not yet — theoretical algorithm; production kernels are the next step ~per this paper

This is the natural extension of the I/O-complexity story that motivates FlashAttention in the first place. FlashAttention's contribution was to make exact attention IO-optimal — every byte that has to move actually moves only once. This paper steps up one rung: if you are willing to accept approximation, the byte budget itself shrinks from n² to near-linear, and that shrinkage is provably the best you can do.

It is a different lever from PagedAttention, which manages where the K/V cache lives across requests but does not change the per-step attention I/O. It is also different from KV-cache quantization and prefix caching, which shrink the bytes that get moved per token but leave the n² scaling intact. This paper attacks the exponent of n directly — a structurally different win that compounds with all of the above at long context.

Goes deeper in: GPU & CUDA → Operator Fusion & FlashAttention → I/O Complexity

Related explainers

vLLM v0.20 — FlashAttention 4 packing — exact attention with packing to remove padding waste; still n² in I/O
vLLM v0.20 — TurboQuant 2-bit KV — orthogonal lever: shrink per-token KV bytes, not the matrix size
AOIAYN — persistent KV cache across queries — yet another orthogonal lever: amortize the prefill so attention only runs once per token
SGLang v0.5.12 — MLA and TokenSpeed — production-side throughput wins on the exact-attention side
PSD — parallel speculative decoding for diffusion LLMs — different inference-shape win on a different model class

Continue in trackOperator Fusion: I/O complexity

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based