What is DeepSeek Sparse Attention in Keye-VL-2.0?

DeepSeek Sparse Attention (DSA) is a sparse-attention scheme that uses a cheap 'lightning indexer' to score every earlier token, then has each query attend to only the top-k highest-scoring tokens instead of all of them. Keye-VL-2.0 adapts it to the multimodal setting, so the selection runs across video frames and text alike — letting a 30B Mixture-of-Experts model (3B active per token) keep a lossless 256K-token context and process hour-level video.

Why does DeepSeek Sparse Attention for video matter?

Long video blows up the token count, and dense attention's cost grows with the square of the sequence length, so an hour of footage is otherwise intractable. By scoring tokens cheaply and keeping only the few that matter per query, DSA holds the per-query cost roughly flat as context grows — which is what lets Keye-VL-2.0 attend over a full 256K window without down-sampling frames and still post state-of-the-art results for its scale on long-video benchmarks.

How is DSA different from block-sparse attention like MSA?

Both skip most of the attention matrix, but at different granularity. Block-sparse schemes (MiniMax's MSA, MoBA) group the KV cache into contiguous blocks and gather whole blocks; DSA selects individual top-k tokens, so the kept tokens can be scattered anywhere across the context. That fine-grained, per-token selection is what makes DSA a natural fit for video, where the frames that matter are often spread far apart across a long timeline.

Kwai Keye-VL-2.0 — DeepSeek Sparse Attention for video

TL;DR

What is it: The Keye-VL-2.0 release from Kwai — a 30B-parameter Mixture-of-Experts vision-language model with only 3B active per token — adapts DeepSeek Sparse Attention (DSA) to the multimodal setting so it can read a lossless 256K-token context and hour-long video.
Why it’s needed: Hour-level video explodes the token count, and ordinary attention's cost grows with the square of the length. DSA keeps the per-query cost roughly flat by attending to only the frames that matter — the kind of efficiency that lets a single model hold a whole video in context and post state-of-the-art results for its scale on long-video benchmarks.
vs previous: Where dense (full) attention makes every token compare against every earlier token — and where older long-video models downsample or drop frames to fit — Keye keeps every frame and instead makes the computation sparse, so the 256K context stays lossless rather than lossily compressed.

Jargon

DeepSeek Sparse Attention (DSA): A sparse-attention scheme from DeepSeek. Instead of every query attending to every earlier token, a lightweight scorer ranks the cached past and the query computes attention over only the top-scoring tokens — fine-grained, token-by-token selection rather than whole contiguous blocks.
Lightning indexer: DSA's cheap pre-scorer. It runs over all earlier tokens at a fraction of the cost of full attention and outputs a relevance score per token, so the expensive attention step only has to look at the few the indexer flags.
Top-k token selection: Keeping only the k highest-scoring tokens per query and ignoring the rest. The smaller k is relative to the context length, the cheaper each query becomes — the sparsity knob that turns quadratic cost into something near-flat.
Mixture-of-Experts (active vs total): The model has 30B total parameters but routes each token through only 3B of them. "Active" is what you pay to run a token; "total" is what you store. Background: it's the same active-vs-total split as any modern MoE.
Lossless 256K context: The model attends over the full 256,000-token window without throwing tokens away. "Lossless" here means the input isn't compressed or down-sampled — the computation is sparse, not the context.
KV cache: The stored Key and Value vectors for every token already processed, so the model never recomputes the past. It grows with context length — at 256K tokens of video it is the binding cost. See KV Cache → Memory Cost.
Cross-Modal Multi-Teacher On-Policy Distillation: Keye's training recipe: several teacher models supervise the student on its own outputs across text and vision, paired with Context-RL and Video-RL — a way to add long-context and video skill without the model forgetting what it already knew.

The news. On June 9, 2026, the Kwai Keye team (52 co-authors) published Keye-VL-2.0, a 30B-parameter Mixture-of-Experts vision-language model that activates just 3B parameters per token. Its headline trick is porting DeepSeek Sparse Attention into a multimodal architecture so it can process a lossless 256K-token context and hour-level video, reporting state-of-the-art results for its scale on Video-MME-v2, LongVideoBench, and TimeLens. Read the paper →

Picture a night guard in front of a wall of camera monitors — one screen per feed, an entire night of footage rolling at once. The brute-force job is to stare at every screen at every moment, which is exactly how dense attention behaves: each new moment has to be checked against every earlier one. That never scales. Double the footage and you roughly quadruple the watching, because each new frame also has to be compared against all the new ones. A good guard doesn't do that. A motion detector scans every feed cheaply and flags only the handful with real action; the guard watches those and ignores the still hallways. Same coverage of what matters, a fraction of the attention.

That is the trade Keye-VL-2.0 makes. In a standard transformer, every token has to compare itself against every earlier token, so the attention work scales with the square of the sequence length. At a few thousand tokens nobody notices; at the 256,000 tokens of an hour of video it dominates everything else the model does. The triangle below is that cost made visible — each new query row adds another full row of comparisons:

DeepSeek Sparse Attention replaces the full sweep with a flag-then-watch. A lightweight lightning indexer runs over the whole cached past and scores every earlier token; the expensive attention step then reads only the top-k the indexer flagged — the few feeds with motion. The crucial detail, and what separates DSA from block-sparse schemes, is that the selection is per token, not per block: the flagged frames can be scattered anywhere across the hour, so a moment three minutes ago and a moment fifty minutes ago can both be kept while the still stretch between them is skipped. Keye's contribution is making that indexer work across modalities — pulling the critical frames out of long video and long text alike — so the 256K window stays lossless instead of being downsampled to fit.

Where the savings come from

Hold the setup fixed and walk the arithmetic. Picture a query at the 256,000th token. Dense attention compares it against all 256,000 cached keys, and summed across the whole sequence that triangle of comparisons is about 256,000² / 2 ≈ 33 billion per layer (derived from the 256K length). The lightning indexer instead scores those 256,000 keys and keeps only the highest few — in DeepSeek's original DSA, roughly the top 2,048 (illustrative; Keye's exact k isn't published). Now each query touches ~2,048 keys instead of 256,000 — about a 125× drop in per-query comparisons (illustrative) — and that gap widens as the video gets longer, because the indexer's k stays fixed while dense attention's bill keeps squaring. The MoE routing compounds it: only 3B of the 30B weights (≈10%, from the release) fire per token, so the model is cheap on both axes at once.

How the attention variants compare

Approach	What each query looks at	Granularity	Cost vs context length
Dense (full) attention	every earlier token	—	grows with the square (n²)
Sliding-window	a fixed nearby window	contiguous, recent only	linear, but drops far context
Block-sparse (e.g. MSA / MoBA)	top-scored contiguous blocks	per block	sub-quadratic
DSA (Keye-VL-2.0)	top-k individual tokens / frames	per token (fine-grained)	near-flat per query at 256K (setup-dependent, illustrative)

A caveat worth keeping: the exact value of k, the indexer's overhead, and the resulting speedup are all setup-dependent — sequence length, how many tokens the selector keeps, the modality mix, and the hardware all move them, and Keye reports a lossless 256K context and SOTA-for-scale benchmarks rather than a single headline multiplier. The durable lesson is the shape: a cheap scorer plus fine-grained top-k selection turns "watch every screen" into "watch the few with motion," and that's what makes hour-level video tractable on a 30B/3B-active model.

Goes deeper in: LLM Internals → Self-Attention → Computing Attention Scores

Related explainers

MiniMax M3 — MiniMax Sparse Attention (MSA) — the block-sparse cousin: gather contiguous KV blocks instead of selecting individual tokens
FlashMemory — Lookahead Sparse Attention — another way to shrink DeepSeek-style long-context attention, by predicting which KV to keep
Parallax — local linear attention — the linear-attention alternative to the top-k selection route DSA takes

Continue in trackLLM Internals — Self-Attention & Computing Attention Scores

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based