The news. On June 1, 2026, MiniMax released M3, an open-weight model that pairs frontier-level coding (59% on SWE-Bench Pro), a 1M-token context window, and native multimodality. The headline architecture change is MiniMax Sparse Attention (MSA) — a block-sparse attention the team reports cuts per-token compute about 20× at one million tokens. Read the release →
Picture the metaphor for a moment. A reader walks into a vast library — every note the model has ever taken sits on the shelves — and asks one question. The lazy approach is to re-read every book in the building before answering. That always works, but the effort grows brutally: double the library and you roughly quadruple the reading, because each new question also has to consider every new book. A good librarian doesn't do that. The notes are filed onto labeled shelves, the librarian glances at the labels, and pulls only the handful of shelves that actually bear on the question. Same answer, a fraction of the walking.
That is exactly the trade dense attention makes — and exactly the one MSA refuses. In a standard transformer, every token has to compare itself against every earlier token, so the attention work scales with the square of the sequence length. At a few thousand tokens nobody notices; at a million it dominates everything else the model does. The triangle below is that cost made visible — each new query row adds another full row of comparisons:
MiniMax Sparse Attention replaces the full sweep with a gather. It cuts the cached past into blocks — think of each block as one labeled shelf — scores which blocks are relevant to the current query, and computes attention over only the selected blocks. MiniMax calls the resulting memory pattern a "KV-outer gather Q": for each query, the engine gathers the chosen KV blocks instead of streaming the whole cache. The team reports this partitions the cache more precisely than earlier block-sparse schemes like DSA or MoBA, which is why M3 holds quality — it matches full attention on the vast majority of capabilities while skipping most of the comparisons.
Where the ~20× actually comes from
Hold the setup fixed and walk the arithmetic. Picture a query at the one-millionth token. Dense attention compares it against all 1,000,000 cached keys. MSA first groups those keys into blocks — say 128 keys each, so roughly 7,800 blocks (illustrative) — scores them, and keeps only the ones that matter. If the selector keeps about 5% of blocks (illustrative), the query now touches ~50,000 keys instead of 1,000,000 — a 20× drop in per-token comparisons, which lines up with the ~20× per-token compute cut MiniMax reports at 1M context. The savings show up twice in serving: >9× faster prefill (the prompt is read in one parallel pass) and >15× faster decode (each new token now gathers a few blocks instead of the whole cache).
How the attention variants compare
| Approach | What each query looks at | Cost vs context length | Note |
|---|---|---|---|
| Dense (full) attention | every earlier token | grows with the square (n²) | the baseline; exact but expensive |
| Sliding-window | a fixed nearby window | linear, but drops far context | cheap; loses long-range recall |
| DSA / MoBA (block selection) | top-scored blocks | sub-quadratic | prior block-sparse schemes |
| MSA (MiniMax) | top-scored KV blocks, gathered | ~20× less per-token compute at 1M (MiniMax; setup-dependent) | "partitions more precisely than DSA / MoBA" |
A caveat worth keeping: the ~20× compute, >9× prefill, and >15× decode figures are MiniMax's own numbers at the 1M-context operating point, and sparse-attention speedups are setup-dependent — block size, how many blocks the selector keeps, sequence length, and the hardware all move them. The qualitative win (gather a few shelves, not the whole library) is the durable lesson; the exact multiplier is a reported headline, not a guarantee at every length.
Goes deeper in: LLM Internals → Self-Attention → Computing Attention Scores
Related explainers
- IO-optimal approximate attention — near-linear IO — a different route to sub-quadratic attention: cut memory traffic rather than select blocks
- Tangram — per-head KV cache budgets — another way to shrink long-context attention cost, by sizing each head's KV budget instead of selecting blocks
- Parallax — local linear attention — the linear-attention alternative to the block-sparse approach MSA takes