The news. On May 16, 2026, the SGLang team released v0.5.12 with TokenSpeed MLA — a new attention backend integrated on NVIDIA Blackwell SM100 with optional FP8 K/V cache. The release notes report a TMA bulk-store implementation of set_mla_kv_buffer running up to ~12× over baseline, alongside DeepSeek V4 day-0 inference and HiCache coordination across DRAM and SSD under a single radix-tree prefix cache.
Picture the orchestra again. Standard multi-head attention (MHA) is the version where every section maintains its own filing cabinet of past performances — strings keep their own copy of every bar ever played, brass keeps its own copy, winds keep theirs. Each filing cabinet is a head's per-token K and V in the KV cache. The cabinets sit in the same room (HBM), but every new performance has to ship updates to every cabinet, and together they outweigh the conductor's score.
MLA changes the cabinet layout. There is now one shared master score on the conductor's stand — the low-rank latent K/V — and every section pulls from it through a small transposition card that adapts the page to their instrument's key. The transposition cards are the per-head up-projections Wᵘ_K and Wᵘ_V; they are part of the orchestra's permanent setup (the model weights), not part of the score itself (the cache). New performances ship one update to the master score, not eight updates to eight cabinets.
Hold the dimensions fixed and walk the cache math. MHA, at 64 heads × 128 dim per head in FP16, caches 2 × 64 × 128 = 16,384 floats per token per layer. MLA, with a latent of d_c = 512 plus a small d_h^R = 64 slot to carry RoPE the standard way, caches 576 floats per token per layer. Ratio per layer: 16384 / 576 ≈ ~28× smaller cache (illustrative; DeepSeek-V3's 128-head configuration hits roughly 57×). Stack 61 transformer layers and a 1M-token context, and the cache that previously pushed long-context serving into multi-GB-per-request now fits in a fraction of one HBM partition.
The cache only shrinks if the kernel honors the savings. Pre-Blackwell MLA implementations on Hopper wrote into the latent buffer with scalar stores — one element per instruction — and kept K/V in FP16, so the cabinet got smaller but the carts moving paper still wheeled one sheet at a time. TokenSpeed MLA on Blackwell rewrites that path. The named win is set_mla_kv_buffer — the per-token cache-write — running at a reported ~12× over baseline by using Blackwell's tensor memory accelerator to bulk-store whole 4-D K/V chunks in a single instruction. The optional FP8 K/V cache path halves the per-element cost on top, so cache reads and writes ride Blackwell's high-bandwidth tier instead of being throttled by elementwise stores.
Two flavors of cache sharing now live on this site. Grouped Query Attention (GQA) has multiple query heads share one K and one V head — same per-head storage layout, just fewer K/V heads. MLA has every head share one low-rank latent, and the per-head work moves from storage time to attention time via the up-projections. GQA keeps the standard Wᵏ and Wᵛ projections; MLA replaces them with a latent-down step plus per-head up-projections. The cache savings compound differently — GQA divides by the head-group ratio, MLA divides by the head-count-to-latent-dim ratio — but both ride the same idea: stop paying for per-head K/V at cache time.