KVarN — Hadamard-rotated 2-bit KV cache

LLM
L
Hadamard rotation spreads outliers so 2-bit KV buckets fitKV-cache channel magnitudes (one token)← 2-bit grid (4 levels)outlier channelsmall value rounds to 0 — errortotal quantization error96%HRotate first — then 2-bit buckets fit every channelquantization error: highmuch lowerKV cache FP162-bit = 8× smallercalibration-free · state-of-the-art at 2-bit (paper-reported)
learnaivisually.com/ai-explained/kvarn-hadamard-2bit-kv-cache

The news. On June 2, 2026, researchers at Huawei's Computing Systems Lab posted the KVarN paper to arXiv. It targets the error that wrecks low-bit KV caches during autoregressive decoding, which the authors trace mainly to incorrect per-token scales. KVarN applies a Hadamard rotation to the keys and values, then a dual-scaling variance normalization across both dimensions of the cache — a combination that is calibration-free and reports state-of-the-art 2-bit KV-cache quantization on the generative benchmarks MATH500, AIME24, and HumanEval. Read the paper →

Picture a tray holding a dozen little piles of sand, plus one giant pile that towers over the rest. You want to measure every pile with a cheap 4-notch dipstick — but the dipstick has to span all the way up to the giant pile, so its four notches end up spaced far apart. Every small pile sits below the lowest notch and reads as zero. The cheap ruler isn't wrong; the one outlier ruined it for everyone else. Now spin the tray: the sand redistributes into an even layer, no pile towers anymore, and the same four-notch dipstick suddenly reads every pile accurately. That spin is the whole idea behind KVarN.

In a transformer, the "piles" are the per-channel magnitudes of the KV cache — the stored keys and values the model rereads on every step. A handful of outlier channels run far hotter than the rest. When you try to store the cache in 2 bits — just four quantization levels — the lone outlier stretches the scale so wide that the ordinary channels all snap to the bottom level and effectively vanish. This is the textbook outlier problem in quantization, and it is why naive 2-bit KV caches fall apart. Earlier methods soften it with per-block scales — a different ruler for each small group of values — but a block that still contains an outlier is back to the same stretched ruler.

KVarN spins the tray instead. A Hadamard rotation is an orthogonal transform whose entries are all ±1; multiplying the keys and values by it mixes every channel into every other, so an outlier's magnitude gets spread thin across all dimensions rather than spiking in one. Because a Hadamard matrix is orthogonal the rotation is reversible: applied in matched pairs and undone downstream, it is designed to leave the attention result intact while the distribution the quantizer actually sees becomes flat, with no spikes. On top of that, a dual-scaling variance normalization corrects the per-token scales that the authors blame for the error in the first place, and the whole pipeline is calibration-free: it needs no sample dataset to tune.

The reason that last detail matters is error accumulation. Decoding is autoregressive — each new token reads the quantized cache, so a rounding error baked in at step 10 is still there, and compounding, at step 1,000. A 2-bit cache that looks fine on a short prompt can quietly drift on a long reasoning chain. By removing the outliers that cause the largest rounding errors and fixing the per-token scales, KVarN keeps the per-step error small enough that it doesn't snowball — which is exactly why it is evaluated on long generative benchmarks like MATH500 and AIME24 rather than one-shot accuracy.

Where the 8× actually comes from

The KV cache's size is pure bookkeeping: 2 (keys + values) × layers × KV-heads × head-dim × bytes-per-value, per token. Take an illustrative 80-layer model with 8 KV heads and a 128-dim head (illustrative — KVarN fixes no architecture). In FP16 that is 2 × 80 × 8 × 128 × 2 bytes ≈ 0.31 MB per token, so a 32,768-token context costs about 10.7 GB — often more than the model weights leave room for. Swap the 2 bytes for 2 bits (0.25 bytes) and the same context drops to ~1.3 GB — an 8× cut. The win is mechanical; the hard part is keeping accuracy at 2-bit, and that is the part the Hadamard rotation buys. Less KV memory directly loosens the binding constraint on serving throughput: the freed gigabytes become longer contexts or a bigger batch.

KV-cache schemeBits / valueOutlier handlingCalibration
FP16 baseline16none needed (full precision)n/a
Per-block-scale 2-bit~2local scale per block — a block with an outlier still stretches (setup-dependent, illustrative)often calibrated
KVarN (Hadamard-rotated 2-bit)~2rotation spreads outliers across all channels (paper)calibration-free (paper)

A caveat worth stating plainly: KVarN reports state-of-the-art results at 2-bit, not lossless ones — 2-bit is still a lossy format, and "state-of-the-art" means best among 2-bit methods on the paper's benchmarks, not free. The Hadamard-rotation trick (sometimes called incoherence processing) is a known idea in weight quantization; KVarN's contribution is adapting it to the KV cache during decoding and pairing it with a calibration-free dual-scaling fix for per-token scales. How much you gain depends on how outlier-heavy your model's cache is in the first place — the spinning tray only helps if there were towering piles to level.

Goes deeper in: LLM Internals → Quantization → The Outlier Problem

Related explainers

Continue in trackQuantization: The Outlier Problem

Frequently Asked Questions