What is Multi-head Latent Attention (MLA)?

MLA is the attention variant introduced in DeepSeek V2 and used in V3 and V4. Instead of caching per-head K and V tensors directly, each layer projects K and V down into a single shared low-rank latent vector of dimension d_c and caches only the latent. Per-head K and V are reconstructed at attention time via small per-head up-projection matrices that are part of the model's static weights, not the cache. The K/V cache shrinks by roughly the ratio of 2 × n_heads × d_head to d_c — on a 64-head 128-dim model that's about 28× smaller per token per layer, and DeepSeek V3's 128-head configuration hits closer to 57× per layer.

Why is the K/V cache the bottleneck at long context?

The per-token K/V cache grows linearly with sequence length, so a 1M-token context across 60+ transformer layers pushes the cache into multi-GB-per-request territory. That cache lives in HBM and is read on every new token, so at long context the cache bandwidth — not the matmul — is what caps throughput on a given GPU. Anything that shrinks the cache by an integer factor (MLA, GQA, KV quantization, prefix sharing) is higher-impact than further compressing the model weights, which only get loaded once. TokenSpeed MLA on Blackwell pairs MLA's algorithmic shrink with TMA bulk-store and optional FP8 K/V to keep the kernel from undoing those savings at the boundary.

How does MLA differ from Grouped Query Attention (GQA)?

GQA has multiple query heads share one K and one V head — the per-head K/V storage layout is unchanged, there are just fewer K/V heads. MLA shares one low-rank latent across all heads and pushes the per-head work into up-projection matrices applied at attention time. GQA preserves the standard W^k and W^v projections; MLA replaces them with a latent-down step plus per-head up-projections. Both reduce K/V cache cost — GQA by the head-group ratio (typically 4× or 8×), MLA by the head-count-to-latent-dim ratio (often 20–60×) — but MLA also changes the math at attention time, where GQA leaves it untouched.

What is TokenSpeed MLA, and what does the ~12× speedup refer to?

TokenSpeed MLA is the new Multi-head Latent Attention backend SGLang v0.5.12 ships for NVIDIA Blackwell SM100. The reported ~12×-over-baseline speedup is specific: it is the per-token cache-write kernel, set_mla_kv_buffer, rewritten to use Blackwell's tensor memory accelerator (TMA) to bulk-store whole K/V chunks in one instruction instead of one element per instruction. Earlier MLA backends on Hopper used scalar stores and kept the K/V in FP16, so MLA's algorithmic cache savings were partly eaten back at the kernel boundary; TokenSpeed closes that gap, with an optional FP8 K/V path halving the per-element cost on top.

What else does SGLang v0.5.12 add besides TokenSpeed MLA?

The release also ships day-0 inference support for DeepSeek V4 and HiCache, which coordinates a prefix cache across DRAM and SSD under a single radix-tree. Reproducibility is a headline: one Docker image (lmsysorg/sglang:v0.5.12) covers NVIDIA B300, B200, H200, H100, GB200, and GB300 plus AMD MI35X. The optional FP8 K/V cache is exposed as a setting on top of MLA's per-token latent, so the two cache savings compose.

SGLang v0.5.12 — TokenSpeed MLA backend

SGLang v0.5.12 — TokenSpeed MLA backend on Blackwell

LLM

learnaivisually.com/ai-explained/sglang-v0-5-12-tokenspeed-mla

TL;DR

What is it: The SGLang v0.5.12 release ships TokenSpeed MLA — a new attention backend for NVIDIA Blackwell SM100 that implements Multi-head Latent Attention (MLA), the DeepSeek-style architecture where every head reads one shared low-rank latent K/V cache instead of its own per-head K/V.
Why it’s needed: Long-context serving is throughput-starved by the per-head K/V cache: at 1M-token contexts the cache eclipses the model weights in HBM and dominates inter-GPU traffic; MLA caches a low-rank latent that is roughly 1/30 the size, so cache reads, evictions, and prefix-cache transfers all compress together — and SGLang's TokenSpeed kernel adds a TMA bulk-store path that the release reports at ~12× over baseline on the per-token cache write.
vs previous: Earlier MLA backends on Hopper used scalar stores into the latent buffer (one element per instruction) and kept the KV in FP16, so the cache savings were eaten back at the kernel boundary; TokenSpeed MLA combines MLA's algorithmic shrink with Blackwell's tensor memory accelerator for bulk stores and an FP8 KV option, closing the gap between the math savings and the realized memory traffic.

Jargon

MLA: Multi-head Latent Attention — the attention variant introduced in DeepSeek V2 and used in V3/V4. Instead of caching per-head K and V, the layer projects both down into one shared low-rank latent vector and caches that. Per-head K and V are reconstructed at attention time via small up-projection matrices that are part of the model's static weights, not the cache.
K/V cache: The stored key and value tensors from every previous token, kept around so each new attention score can read them without recomputing. At long context the cache dominates HBM and inter-GPU traffic — which is why shrinking it is the highest-impact serving optimization.
Up-projection Wᵘ: A small per-head matrix (Wᵘ_K for keys, Wᵘ_V for values) that reconstructs each head's full K or V vector from the shared latent at attention time. These weights live with the model, not in the cache, so adding heads doesn't grow per-token cache cost.
TMA bulk-store: Blackwell's Tensor Memory Accelerator bulk-store instruction moves an entire tile of values from registers to shared or global memory in one operation. Replacing a per-element loop with one TMA call is what the SGLang release credits with the ~12× speedup on set_mla_kv_buffer, the per-token cache-write kernel.
FP8 KV cache: Storing the K and V values (or, here, the MLA latent) in FP8 instead of FP16, roughly halving cache bytes per element. TokenSpeed MLA exposes this as an optional setting on top of MLA's per-token reduction — the two savings compose.
GQA: Grouped Query Attention — an earlier cache-sharing arrangement where multiple query heads share one K head and one V head. GQA keeps the standard projections but reduces head count on the K/V side. MLA goes further: every head shares one low-rank latent, with per-head up-projections doing the per-head work at attention time. See the GQA step for the head-level version of the same idea.
SGLang: An open-source inference engine in the same category as vLLM. v0.5.12 is the release that adds TokenSpeed MLA and ships day-0 support for DeepSeek V4, with the headline reproducibility claim being a single Docker image (lmsysorg/sglang:v0.5.12) covering NVIDIA B300/B200/H200/H100/GB200/GB300 and AMD MI35X.

The news. On May 16, 2026, the SGLang team released v0.5.12 with TokenSpeed MLA — a new attention backend integrated on NVIDIA Blackwell SM100 with optional FP8 K/V cache. The release notes report a TMA bulk-store implementation of set_mla_kv_buffer running up to ~12× over baseline, alongside DeepSeek V4 day-0 inference and HiCache coordination across DRAM and SSD under a single radix-tree prefix cache.

Picture the orchestra again. Standard multi-head attention (MHA) is the version where every section maintains its own filing cabinet of past performances — strings keep their own copy of every bar ever played, brass keeps its own copy, winds keep theirs. Each filing cabinet is a head's per-token K and V in the KV cache. The cabinets sit in the same room (HBM), but every new performance has to ship updates to every cabinet, and together they outweigh the conductor's score.

MLA changes the cabinet layout. There is now one shared master score on the conductor's stand — the low-rank latent K/V — and every section pulls from it through a small transposition card that adapts the page to their instrument's key. The transposition cards are the per-head up-projections Wᵘ_K and Wᵘ_V; they are part of the orchestra's permanent setup (the model weights), not part of the score itself (the cache). New performances ship one update to the master score, not eight updates to eight cabinets.

Hold the dimensions fixed and walk the cache math. MHA, at 64 heads × 128 dim per head in FP16, caches 2 × 64 × 128 = 16,384 floats per token per layer. MLA, with a latent of d_c = 512 plus a small d_h^R = 64 slot to carry RoPE the standard way, caches 576 floats per token per layer. Ratio per layer: 16384 / 576 ≈ ~28× smaller cache (illustrative; DeepSeek-V3's 128-head configuration hits roughly 57×). Stack 61 transformer layers and a 1M-token context, and the cache that previously pushed long-context serving into multi-GB-per-request now fits in a fraction of one HBM partition.

The cache only shrinks if the kernel honors the savings. Pre-Blackwell MLA implementations on Hopper wrote into the latent buffer with scalar stores — one element per instruction — and kept K/V in FP16, so the cabinet got smaller but the carts moving paper still wheeled one sheet at a time. TokenSpeed MLA on Blackwell rewrites that path. The named win is set_mla_kv_buffer — the per-token cache-write — running at a reported ~12× over baseline by using Blackwell's tensor memory accelerator to bulk-store whole 4-D K/V chunks in a single instruction. The optional FP8 K/V cache path halves the per-element cost on top, so cache reads and writes ride Blackwell's high-bandwidth tier instead of being throttled by elementwise stores.

Two flavors of cache sharing now live on this site. Grouped Query Attention (GQA) has multiple query heads share one K and one V head — same per-head storage layout, just fewer K/V heads. MLA has every head share one low-rank latent, and the per-head work moves from storage time to attention time via the up-projections. GQA keeps the standard Wᵏ and Wᵛ projections; MLA replaces them with a latent-down step plus per-head up-projections. The cache savings compound differently — GQA divides by the head-group ratio, MLA divides by the head-count-to-latent-dim ratio — but both ride the same idea: stop paying for per-head K/V at cache time.

Goes deeper in: LLM Internals → KV Cache → GQA and LLM Serving → Inference Engine → Memory Manager

Related explainers

DeepSeek V4-Pro and V4-Flash — long-context cost cut to a fraction — the model family that popularized MLA as the long-context recipe
vLLM v0.20 — FlashAttention 4 packing — a different attention-kernel win, this time on the attention compute side rather than the cache write
vLLM v0.20 — TurboQuant 2-bit KV cache — orthogonal axis: compress K/V values themselves, on top of MLA-style sharing
SP-KV — utility predictor for the KV cache — choose which K/V entries to keep at all

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based