I/O-optimal approximate attention — near-linear vs n²

GPU
L
I/O complexity comparison — FlashAttention reads every tile of the n by n attention matrix between SRAM and HBM, producing quadratic data movement. The approximate algorithm samples roughly n log n tiles, producing near-linear data movement, with matching I/O lower bounds proven.
learnaivisually.com/ai-explained/io-optimal-approx-attention-near-linear-io

The news. On May 22, 2026, a new research paper derived I/O-aware variants of approximate attention whose data movement between SRAM and HBM scales almost-linearly in sequence length n in most realistic parameter regimes. The work builds on the 2023 Alman–Song approximate-attention framework, which gave controlled-error approximations of the attention matrix but did not analyze the I/O cost on hierarchical memory. The new contribution is twofold: an algorithm whose I/O is near-linear in n, and matching lower bounds proving no algorithm in the same regime can do strictly better.

Picture a librarian comparing books. Two shelves, hundreds of titles each. The librarian's desk is small — three books fit on it at once. To produce an exact pairwise similarity matrix, the librarian must pull every pair to the desk, compare them, and put them back: every trip is a tiny piece of work, but the number of trips grows with the square of the number of books. That is the loop FlashAttention runs on a GPU. The "shelves" are HBM — large, slow, where the full Q, K, V tensors live. The "desk" is SRAM — tiny, fast, where the actual softmax happens. FlashAttention is already IO-aware: it tiles the n × n matrix into SRAM-sized blocks and processes each block exactly once, which is the best you can do if you insist on reading the entire matrix. The librarian still has to fetch every pair.

The new paper asks: what if the librarian doesn't have to read every page? Approximate attention swaps the exact softmax for a controlled-error stand-in — bounded to within an explicit error parameter ε in the analysis, cheap enough that the librarian only needs to inspect a structured sample of pairs. The paper makes a precise complexity claim; what an ε-close attention matrix means for the loss curve of any specific model is an empirical question this work doesn't open. The compute story for approximate attention was already known: Alman and Song's 2023 framework showed you can drive flop count below with bounded error. What was missing was the I/O story — bytes moved between SRAM and HBM is what wall-clock long-context attention actually pays for, and a flop-cheap algorithm that shuffles bytes the same way as the exact one saves nothing in practice.

This paper closes that gap. The authors derive algorithms whose I/O between SRAM and HBM scales as roughly n · log n instead of , in most realistic parameter regimes — and then prove matching lower bounds, showing no algorithm in the same regime can shave more without strengthening the problem assumptions. That second half is what makes the result load-bearing: it is not "we found a faster algorithm and there might be a better one." It is "this is near-optimal, and the constant factors are all that's left."

Where the I/O ratio actually shows up

Hold three variables fixed. Head dimension d = 128. Per-element size 4 bytes (fp32 partial sums during the softmax accumulation). SRAM size M = 220 KB (an H100 SM's shared-memory budget). Now sweep sequence length n. FlashAttention's I/O scales as roughly 4 · n² · d bytes — every pair of tokens contributes a tile read. The new algorithm's I/O scales as roughly 4 · n · log₂(n) · d bytes — only a sampled set of tile reads. At a small n = 8K the ratio is already ~600× (illustrative — actual ratios depend on the specific approximation parameters); at n = 128K it crosses ~8,000×; at n = 2M it is on the order of ~98,000× (illustrative). The compute story scales the same way — quadratic gives way to near-linear — but the I/O story is the one that hits wall-clock first, because long-context attention is memory-bandwidth-bound, not compute-bound. The hero animation above plots both curves on a log-log axis as n sweeps; the dot on the pink curve climbs above the chart frame around n = 128K, while the indigo dot barely moves.

PropertyFlashAttention (exact)This paper (approximate)
OutputExact softmax-attention matrixApproximate matrix within controlled error ε
ComputeΘ(n² · d) ~setup-dependent, illustrativeRoughly Õ(n · d), per the Alman–Song line ~setup-dependent, illustrative
I/O between SRAM and HBMΘ(n² · d / √M) ~from FlashAttention paperNear-linear in n in most realistic parameter regimes ~per this paper
Lower boundMatches the upper bound — proven IO-optimal for exact attentionMatching I/O lower bound proven — near-optimal among approximate algorithms in the same regime
When it pays offAny sequence length — the canonical kernel ~setup-dependentLong contexts where bandwidth dominates and approximation error is acceptable ~setup-dependent, illustrative
Drop-in for production?Yes — widely deployed since 2022Not yet — theoretical algorithm; production kernels are the next step ~per this paper

This is the natural extension of the I/O-complexity story that motivates FlashAttention in the first place. FlashAttention's contribution was to make exact attention IO-optimal — every byte that has to move actually moves only once. This paper steps up one rung: if you are willing to accept approximation, the byte budget itself shrinks from to near-linear, and that shrinkage is provably the best you can do.

It is a different lever from PagedAttention, which manages where the K/V cache lives across requests but does not change the per-step attention I/O. It is also different from KV-cache quantization and prefix caching, which shrink the bytes that get moved per token but leave the scaling intact. This paper attacks the exponent of n directly — a structurally different win that compounds with all of the above at long context.

Goes deeper in: GPU & CUDA → Operator Fusion & FlashAttention → I/O Complexity

Related explainers

Frequently Asked Questions