KVarN — Hadamard-rotated 2-bit KV cache
LLMThe news. On June 2, 2026, researchers at Huawei's Computing Systems Lab posted the KVarN paper to arXiv. It targets the error that wrecks low-bit KV caches during autoregressive decoding, which the authors trace mainly to incorrect per-token scales. KVarN applies a Hadamard rotation to the keys and values, then a dual-scaling variance normalization across both dimensions of the cache — a combination that is calibration-free and reports state-of-the-art 2-bit KV-cache quantization on the generative benchmarks MATH500, AIME24, and HumanEval. Read the paper →
Picture a tray holding a dozen little piles of sand, plus one giant pile that towers over the rest. You want to measure every pile with a cheap 4-notch dipstick — but the dipstick has to span all the way up to the giant pile, so its four notches end up spaced far apart. Every small pile sits below the lowest notch and reads as zero. The cheap ruler isn't wrong; the one outlier ruined it for everyone else. Now spin the tray: the sand redistributes into an even layer, no pile towers anymore, and the same four-notch dipstick suddenly reads every pile accurately. That spin is the whole idea behind KVarN.
In a transformer, the "piles" are the per-channel magnitudes of the KV cache — the stored keys and values the model rereads on every step. A handful of outlier channels run far hotter than the rest. When you try to store the cache in 2 bits — just four quantization levels — the lone outlier stretches the scale so wide that the ordinary channels all snap to the bottom level and effectively vanish. This is the textbook outlier problem in quantization, and it is why naive 2-bit KV caches fall apart. Earlier methods soften it with per-block scales — a different ruler for each small group of values — but a block that still contains an outlier is back to the same stretched ruler.
KVarN spins the tray instead. A Hadamard rotation is an orthogonal transform whose entries are all ±1; multiplying the keys and values by it mixes every channel into every other, so an outlier's magnitude gets spread thin across all dimensions rather than spiking in one. Because a Hadamard matrix is orthogonal the rotation is reversible: applied in matched pairs and undone downstream, it is designed to leave the attention result intact while the distribution the quantizer actually sees becomes flat, with no spikes. On top of that, a dual-scaling variance normalization corrects the per-token scales that the authors blame for the error in the first place, and the whole pipeline is calibration-free: it needs no sample dataset to tune.
The reason that last detail matters is error accumulation. Decoding is autoregressive — each new token reads the quantized cache, so a rounding error baked in at step 10 is still there, and compounding, at step 1,000. A 2-bit cache that looks fine on a short prompt can quietly drift on a long reasoning chain. By removing the outliers that cause the largest rounding errors and fixing the per-token scales, KVarN keeps the per-step error small enough that it doesn't snowball — which is exactly why it is evaluated on long generative benchmarks like MATH500 and AIME24 rather than one-shot accuracy.
Where the 8× actually comes from
The KV cache's size is pure bookkeeping: 2 (keys + values) × layers × KV-heads × head-dim × bytes-per-value, per token. Take an illustrative 80-layer model with 8 KV heads and a 128-dim head (illustrative — KVarN fixes no architecture). In FP16 that is 2 × 80 × 8 × 128 × 2 bytes ≈ 0.31 MB per token, so a 32,768-token context costs about 10.7 GB — often more than the model weights leave room for. Swap the 2 bytes for 2 bits (0.25 bytes) and the same context drops to ~1.3 GB — an 8× cut. The win is mechanical; the hard part is keeping accuracy at 2-bit, and that is the part the Hadamard rotation buys. Less KV memory directly loosens the binding constraint on serving throughput: the freed gigabytes become longer contexts or a bigger batch.
| KV-cache scheme | Bits / value | Outlier handling | Calibration |
|---|---|---|---|
| FP16 baseline | 16 | none needed (full precision) | n/a |
| Per-block-scale 2-bit | ~2 | local scale per block — a block with an outlier still stretches (setup-dependent, illustrative) | often calibrated |
| KVarN (Hadamard-rotated 2-bit) | ~2 | rotation spreads outliers across all channels (paper) | calibration-free (paper) |
A caveat worth stating plainly: KVarN reports state-of-the-art results at 2-bit, not lossless ones — 2-bit is still a lossy format, and "state-of-the-art" means best among 2-bit methods on the paper's benchmarks, not free. The Hadamard-rotation trick (sometimes called incoherence processing) is a known idea in weight quantization; KVarN's contribution is adapting it to the KV cache during decoding and pairing it with a calibration-free dual-scaling fix for per-token scales. How much you gain depends on how outlier-heavy your model's cache is in the first place — the spinning tray only helps if there were towering piles to level.
Goes deeper in: LLM Internals → Quantization → The Outlier Problem
Related explainers
- TurboQuant — 2-bit KV cache with per-block scales — the per-block-scale approach KVarN's rotation is meant to beat
- Quantization-conditioned attack — outlier injection — what happens when an outlier is weaponized against per-block scales
- SP-KV — self-pruned KV cache — a complementary lever: drop tokens from the cache instead of shrinking each value