vLLM v0.20 — TurboQuant 2-bit KV cache

LLM
L
TurboQuant 2-bit KV cache animation — naive global quantization
learnaivisually.com/ai-explained/vllm-v0-20-turboquant-kv

The news. In April 2026, vLLM v0.20.0 shipped with TurboQuant 2-bit KV cache compression alongside FlashAttention 4, CUDA 13.0, and PyTorch 2.11. TurboQuant stores every cached key and value at 2 bits instead of 16, using block-wise asymmetric scaling so outliers in the KV distribution don't crush the resolution of the small values around them. Read the release →

Picture a national grocery chain trying to print one price tag that has to fit every store in the country. If a luxury market in Manhattan stocks a $4,000 watch, the printed range has to stretch from a 25¢ candy bar to $4,000 — and once you only get four price levels on the tag, the candy bar and the milk both round to "cheap," indistinguishable. The watch fits, but every other product loses the resolution that mattered. That's exactly what naive 2-bit quantization does to the KV cache.

TurboQuant rewires the chain. Each store gets its own price tag with its own min and max — its own per-block scale. The Manhattan store still encodes the watch correctly, but the suburban store's milk and candy bar get the full four-level range within the suburban store's own range. The watch's outsized price doesn't leak across town. In KV terms, the cache is split into small blocks of 8 or 16 consecutive values; each block stores its own minimum and scale (two fp16 scalars), and the values inside it are encoded as 2-bit offsets from the block minimum. Outliers stay isolated to the block they live in.

The underlying motivation is that the KV cache is the dominant memory consumer during decode. Every token written to the cache contributes a key and a value vector for each layer and head; at fp16 that's 2 bytes per number. For a 70B-parameter model at 8K context, the cache can grow to roughly 16 GB of GPU memory — which is what's actually limiting how many users you can run on one GPU, not the model weights. Cutting per-value precision from 16 bits to 2 bits is an 8× raw reduction; the per-block scale metadata claws some of that back, but the realistic landing point is still a clean 4× saving with negligible quality loss.

K and V
Two vectors stored per token (Key + Value)
× 2
Layers
Each layer has its own cache (like 32 filing cabinets)
× 32
Heads
Each attention head stores its own K/V pair
× 32
Head size
Each K or V vector has 128 numbers (d_head)
× 128
Bytes per number
FP16 = 2 bytes per number (half precision)
× 2
Per token (Llama-2 7B):2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 512 KB

The arithmetic is small enough to do by hand. With block size 16, each block stores 16 values × 2 bits = 4 bytes of quantized data plus 2 fp16 scalars (4 bytes) — 8 bytes total for a block that previously consumed 16 × 2 = 32 bytes at fp16. That's exactly 4× smaller at this realistic block size, and the ratio approaches the theoretical 8× as block sizes grow and the scale-metadata overhead amortizes. At the worked-example scale: a 70B model at 8K context that needed approximately 16 GB of KV cache at fp16 lands around 4 GB total under TurboQuant — freeing the rest for either 4× more concurrent users on the same GPU or a proportionally longer context window per user.

Goes deeper in: LLM Internals → KV Cache → Memory and LLM Internals → Quantization → The Outlier Problem

Related explainers

Frequently Asked Questions