The news. On June 1, 2026, MiniMax released M3, an open-weight model that pairs frontier-level coding (59% on SWE-Bench Pro), a 1M-token context window, and native multimodality. The headline architecture change is MiniMax Sparse Attention (MSA) — a block-sparse attention the team reports cuts per-token compute about 20× at one million tokens. Read the release →

Picture the metaphor for a moment. A reader walks into a vast library — every note the model has ever taken sits on the shelves — and asks one question. The lazy approach is to re-read every book in the building before answering. That always works, but the effort grows brutally: double the library and you roughly quadruple the reading, because each new question also has to consider every new book. A good librarian doesn't do that. The notes are filed onto labeled shelves, the librarian glances at the labels, and pulls only the handful of shelves that actually bear on the question. Same answer, a fraction of the walking.

That is exactly the trade dense attention makes — and exactly the one MSA refuses. In a standard transformer, every token has to compare itself against every earlier token, so the attention work scales with the square of the sequence length. At a few thousand tokens nobody notices; at a million it dominates everything else the model does. The triangle below is that cost made visible — each new query row adds another full row of comparisons:

ThecatsatonthematStep 11 opStep 22 opsStep 33 opsStep 44 opsStep 55 opsStep 66 ops= 21 totaln(n+1)/2 = 6×7/2

MiniMax Sparse Attention replaces the full sweep with a gather. It cuts the cached past into blocks — think of each block as one labeled shelf — scores which blocks are relevant to the current query, and computes attention over only the selected blocks. MiniMax calls the resulting memory pattern a "KV-outer gather Q": for each query, the engine gathers the chosen KV blocks instead of streaming the whole cache. The team reports this partitions the cache more precisely than earlier block-sparse schemes like DSA or MoBA, which is why M3 holds quality — it matches full attention on the vast majority of capabilities while skipping most of the comparisons.

Where the ~20× actually comes from

Hold the setup fixed and walk the arithmetic. Picture a query at the one-millionth token. Dense attention compares it against all 1,000,000 cached keys. MSA first groups those keys into blocks — say 128 keys each, so roughly 7,800 blocks (illustrative) — scores them, and keeps only the ones that matter. If the selector keeps about 5% of blocks (illustrative), the query now touches ~50,000 keys instead of 1,000,000 — a 20× drop in per-token comparisons, which lines up with the ~20× per-token compute cut MiniMax reports at 1M context. The savings show up twice in serving: >9× faster prefill (the prompt is read in one parallel pass) and >15× faster decode (each new token now gathers a few blocks instead of the whole cache).

How the attention variants compare

ApproachWhat each query looks atCost vs context lengthNote
Dense (full) attentionevery earlier tokengrows with the square (n²)the baseline; exact but expensive
Sliding-windowa fixed nearby windowlinear, but drops far contextcheap; loses long-range recall
DSA / MoBA (block selection)top-scored blockssub-quadraticprior block-sparse schemes
MSA (MiniMax)top-scored KV blocks, gathered~20× less per-token compute at 1M (MiniMax; setup-dependent)"partitions more precisely than DSA / MoBA"

A caveat worth keeping: the ~20× compute, >9× prefill, and >15× decode figures are MiniMax's own numbers at the 1M-context operating point, and sparse-attention speedups are setup-dependent — block size, how many blocks the selector keeps, sequence length, and the hardware all move them. The qualitative win (gather a few shelves, not the whole library) is the durable lesson; the exact multiplier is a reported headline, not a guarantee at every length.

Goes deeper in: LLM Internals → Self-Attention → Computing Attention Scores

Related explainers

Continue in trackLLM Internals — Self-Attention & Computing Attention Scores

Frequently Asked Questions