Why don't a single global scale and 2-bit quantization work?

A global scale has to span the entire KV distribution, including rare large outliers. With only four 2-bit codes, the outliers stretch the range so much that all the small values — which are most of the cache — collapse onto the same one or two codes, losing the resolution where the model needed it most. Per-block scales avoid this because an outlier only affects its own block's range.

How much memory does TurboQuant actually save in practice?

At a typical block size of 16, the per-block scale metadata adds ~4 bytes of overhead for every 4 bytes of 2-bit data — so the realistic effective rate is around 4 bits per value, a clean 4× saving over fp16. Larger block sizes amortize the overhead further and approach the 8× theoretical limit; for a 70B model at 8K context, that's ~16 GB of fp16 KV cache collapsing to roughly 4 GB.

vLLM v0.20 — TurboQuant 2-bit KV cache

LLM

learnaivisually.com/ai-explained/vllm-v0-20-turboquant-kv

TL;DR

What is it: TurboQuant is a 2-bit KV cache compression scheme shipped in vLLM v0.20. The cache is split into small blocks of 8 or 16 consecutive values; each block stores its own min and scale (two fp16 scalars), and individual values are encoded as 2-bit offsets within the block’s local range.
Why it’s needed: The KV cache is the dominant memory consumer during decode — at 8K context on a 70B model it’s ~16 GB, which is what caps concurrent users on a single GPU, not the model weights. Cutting per-value precision is the most direct way to free that memory.
vs previous: Naïve global-scale 2-bit quantization has to span the full KV distribution including rare outliers; with only four codes, small values collapse onto one or two levels and quality crashes. Per-block scales isolate outliers — realistic landing point is a clean 4× memory cut with negligible quality loss.

Jargon

KV cache: The key-value cache stores intermediate attention tensors for every previously seen token so the model doesn't recompute them on each new decode step. See the full explainer at LLM Internals → KV Cache → Memory.
quantization: Reducing the numeric precision of model weights or activations — e.g. from 16-bit floats to 2-bit integers — to save memory and speed up arithmetic. The core challenge is preserving resolution where values are dense. Learn more at LLM Internals → Quantization → Outlier Problem.
outlier: A value far larger in magnitude than the surrounding values in a tensor. Even a handful of outliers can force a global quantization scale to cover a wide range, leaving most values with almost no resolution.
block-wise scaling: Dividing a tensor into small fixed-size blocks and giving each block its own independent min and max (the scale). Outliers in one block don't stretch the range of neighbouring blocks, so small values elsewhere keep their full 2-bit resolution.
asymmetric scale: A quantization scheme that stores both a minimum and a range (scale) for each block, rather than a single symmetric zero-centred scale. This lets the codec represent the true floor and ceiling of each block without wasting codes on negative values that don't exist.
fp16: 16-bit floating-point — the standard precision for KV cache entries before quantization. Each value occupies 2 bytes; at 70B parameters and 8K context this adds up to roughly 16 GB of GPU memory just for the cache.
HBM: High Bandwidth Memory — the on-chip DRAM stack on a GPU (e.g. 80 GB on an A100). The KV cache lives here during inference; reducing its size directly frees HBM for additional concurrent requests or longer contexts.
GQA: Grouped-Query Attention — an attention variant where multiple query heads share a single key/value head, reducing the KV cache size by the grouping factor (typically 4–8×). TurboQuant stacks on top of GQA for compounding savings.

The news. In April 2026, vLLM v0.20.0 shipped with TurboQuant 2-bit KV cache compression alongside FlashAttention 4, CUDA 13.0, and PyTorch 2.11. TurboQuant stores every cached key and value at 2 bits instead of 16, using block-wise asymmetric scaling so outliers in the KV distribution don't crush the resolution of the small values around them. Read the release →

Picture a national grocery chain trying to print one price tag that has to fit every store in the country. If a luxury market in Manhattan stocks a $4,000 watch, the printed range has to stretch from a 25¢ candy bar to $4,000 — and once you only get four price levels on the tag, the candy bar and the milk both round to "cheap," indistinguishable. The watch fits, but every other product loses the resolution that mattered. That's exactly what naive 2-bit quantization does to the KV cache.

TurboQuant rewires the chain. Each store gets its own price tag with its own min and max — its own per-block scale. The Manhattan store still encodes the watch correctly, but the suburban store's milk and candy bar get the full four-level range within the suburban store's own range. The watch's outsized price doesn't leak across town. In KV terms, the cache is split into small blocks of 8 or 16 consecutive values; each block stores its own minimum and scale (two fp16 scalars), and the values inside it are encoded as 2-bit offsets from the block minimum. Outliers stay isolated to the block they live in.

The underlying motivation is that the KV cache is the dominant memory consumer during decode. Every token written to the cache contributes a key and a value vector for each layer and head; at fp16 that's 2 bytes per number. For a 70B-parameter model at 8K context, the cache can grow to roughly 16 GB of GPU memory — which is what's actually limiting how many users you can run on one GPU, not the model weights. Cutting per-value precision from 16 bits to 2 bits is an 8× raw reduction; the per-block scale metadata claws some of that back, but the realistic landing point is still a clean 4× saving with negligible quality loss.

The arithmetic is small enough to do by hand. With block size 16, each block stores 16 values × 2 bits = 4 bytes of quantized data plus 2 fp16 scalars (4 bytes) — 8 bytes total for a block that previously consumed 16 × 2 = 32 bytes at fp16. That's exactly 4× smaller at this realistic block size, and the ratio approaches the theoretical 8× as block sizes grow and the scale-metadata overhead amortizes. At the worked-example scale: a 70B model at 8K context that needed approximately 16 GB of KV cache at fp16 lands around 4 GB total under TurboQuant — freeing the rest for either 4× more concurrent users on the same GPU or a proportionally longer context window per user.

Goes deeper in: LLM Internals → KV Cache → Memory and LLM Internals → Quantization → The Outlier Problem

Related explainers

vLLM v0.20 — FlashAttention 4 packing

vLLM v0.20 — TurboQuant 2-bit KV cache

Related explainers

Frequently Asked Questions