The news. On June 9, 2026, the Kwai Keye team (52 co-authors) published Keye-VL-2.0, a 30B-parameter Mixture-of-Experts vision-language model that activates just 3B parameters per token. Its headline trick is porting DeepSeek Sparse Attention into a multimodal architecture so it can process a lossless 256K-token context and hour-level video, reporting state-of-the-art results for its scale on Video-MME-v2, LongVideoBench, and TimeLens. Read the paper →
Picture a night guard in front of a wall of camera monitors — one screen per feed, an entire night of footage rolling at once. The brute-force job is to stare at every screen at every moment, which is exactly how dense attention behaves: each new moment has to be checked against every earlier one. That never scales. Double the footage and you roughly quadruple the watching, because each new frame also has to be compared against all the new ones. A good guard doesn't do that. A motion detector scans every feed cheaply and flags only the handful with real action; the guard watches those and ignores the still hallways. Same coverage of what matters, a fraction of the attention.
That is the trade Keye-VL-2.0 makes. In a standard transformer, every token has to compare itself against every earlier token, so the attention work scales with the square of the sequence length. At a few thousand tokens nobody notices; at the 256,000 tokens of an hour of video it dominates everything else the model does. The triangle below is that cost made visible — each new query row adds another full row of comparisons:
DeepSeek Sparse Attention replaces the full sweep with a flag-then-watch. A lightweight lightning indexer runs over the whole cached past and scores every earlier token; the expensive attention step then reads only the top-k the indexer flagged — the few feeds with motion. The crucial detail, and what separates DSA from block-sparse schemes, is that the selection is per token, not per block: the flagged frames can be scattered anywhere across the hour, so a moment three minutes ago and a moment fifty minutes ago can both be kept while the still stretch between them is skipped. Keye's contribution is making that indexer work across modalities — pulling the critical frames out of long video and long text alike — so the 256K window stays lossless instead of being downsampled to fit.
Where the savings come from
Hold the setup fixed and walk the arithmetic. Picture a query at the 256,000th token. Dense attention compares it against all 256,000 cached keys, and summed across the whole sequence that triangle of comparisons is about 256,000² / 2 ≈ 33 billion per layer (derived from the 256K length). The lightning indexer instead scores those 256,000 keys and keeps only the highest few — in DeepSeek's original DSA, roughly the top 2,048 (illustrative; Keye's exact k isn't published). Now each query touches ~2,048 keys instead of 256,000 — about a 125× drop in per-query comparisons (illustrative) — and that gap widens as the video gets longer, because the indexer's k stays fixed while dense attention's bill keeps squaring. The MoE routing compounds it: only 3B of the 30B weights (≈10%, from the release) fire per token, so the model is cheap on both axes at once.
How the attention variants compare
| Approach | What each query looks at | Granularity | Cost vs context length |
|---|---|---|---|
| Dense (full) attention | every earlier token | — | grows with the square (n²) |
| Sliding-window | a fixed nearby window | contiguous, recent only | linear, but drops far context |
| Block-sparse (e.g. MSA / MoBA) | top-scored contiguous blocks | per block | sub-quadratic |
| DSA (Keye-VL-2.0) | top-k individual tokens / frames | per token (fine-grained) | near-flat per query at 256K (setup-dependent, illustrative) |
A caveat worth keeping: the exact value of k, the indexer's overhead, and the resulting speedup are all setup-dependent — sequence length, how many tokens the selector keeps, the modality mix, and the hardware all move them, and Keye reports a lossless 256K context and SOTA-for-scale benchmarks rather than a single headline multiplier. The durable lesson is the shape: a cheap scorer plus fine-grained top-k selection turns "watch every screen" into "watch the few with motion," and that's what makes hour-level video tractable on a 30B/3B-active model.
Goes deeper in: LLM Internals → Self-Attention → Computing Attention Scores
Related explainers
- MiniMax M3 — MiniMax Sparse Attention (MSA) — the block-sparse cousin: gather contiguous KV blocks instead of selecting individual tokens
- FlashMemory — Lookahead Sparse Attention — another way to shrink DeepSeek-style long-context attention, by predicting which KV to keep
- Parallax — local linear attention — the linear-attention alternative to the top-k selection route DSA takes