What is Hadamard-rotated 2-bit KV-cache quantization?

It is the method in the KVarN paper for storing a transformer's KV cache in 2 bits. Before quantizing, KVarN multiplies the keys and values by a Hadamard matrix — an orthogonal ±1 transform — which spreads outlier channels evenly across all dimensions without changing attention's math. With the outliers gone, the four 2-bit levels fit every channel, and a dual-scaling variance normalization corrects per-token scales. The whole process is calibration-free.

Why does an outlier ruin 2-bit quantization?

2-bit gives you only four storage levels. The scale has to span from the smallest value to the largest, so a single outlier channel forces those four levels far apart. Every ordinary value then falls below the lowest level and rounds to zero, destroying the information. A Hadamard rotation removes the lone spike by smearing its magnitude across all channels, so the four levels can sit snugly over the real distribution.

How is KVarN different from per-block-scale KV quantization?

Per-block-scale methods (like TurboQuant) fight outliers locally — each small block of values gets its own scale. But any block that still contains an outlier is back to a stretched ruler. KVarN instead rotates the data so outliers don't exist in any block, then quantizes. It also targets error accumulation directly: by fixing the per-token scales the authors blame for low-bit failure, it keeps the rounding error from compounding across long decodes — and it needs no calibration data.

KVarN squeezes the KV cache to 2 bits — Hadamard rotation

KVarN — Hadamard-rotated 2-bit KV cache

LLM

learnaivisually.com/ai-explained/kvarn-hadamard-2bit-kv-cache

Jargon

KV cache: The stored keys and values from every past token, kept so the model doesn't recompute them each step. It is the dominant memory cost in long-context inference.
Quantization: Storing a number in fewer bits. 2-bit means each value is snapped to one of just four levels — 8× smaller than 16-bit, but only if the values line up with those levels.
Outlier channel: A dimension whose values are far larger than the rest. One outlier forces the quantization scale wide, and the small values then collapse toward zero — the classic outlier problem.
Hadamard rotation: An orthogonal transform built from ±1 entries that mixes every channel into every other. It preserves the math the attention needs but smears each outlier's magnitude evenly across all channels, so no single spike remains.
Variance normalization (dual-scaling): KVarN rescales the cache along both of its dimensions so each token's values share a comparable spread — the paper attributes low-bit failure to incorrect per-token scales, and this directly fixes them.
Calibration-free: Needs no held-out sample data to set its scales. Calibration-based methods can overfit the calibration set or drift on unusual inputs; KVarN's rotation and scaling are computed from the cache itself.
Error accumulation: In autoregressive decoding each new token reads the quantized cache, so a small rounding error early on feeds into later steps and grows. Stopping that compounding is what keeps 2-bit usable on long reasoning chains.

The news. On June 2, 2026, researchers at Huawei's Computing Systems Lab posted the KVarN paper to arXiv. It targets the error that wrecks low-bit KV caches during autoregressive decoding, which the authors trace mainly to incorrect per-token scales. KVarN applies a Hadamard rotation to the keys and values, then a dual-scaling variance normalization across both dimensions of the cache — a combination that is calibration-free and reports state-of-the-art 2-bit KV-cache quantization on the generative benchmarks MATH500, AIME24, and HumanEval. Read the paper →

Picture a tray holding a dozen little piles of sand, plus one giant pile that towers over the rest. You want to measure every pile with a cheap 4-notch dipstick — but the dipstick has to span all the way up to the giant pile, so its four notches end up spaced far apart. Every small pile sits below the lowest notch and reads as zero. The cheap ruler isn't wrong; the one outlier ruined it for everyone else. Now spin the tray: the sand redistributes into an even layer, no pile towers anymore, and the same four-notch dipstick suddenly reads every pile accurately. That spin is the whole idea behind KVarN.

In a transformer, the "piles" are the per-channel magnitudes of the KV cache — the stored keys and values the model rereads on every step. A handful of outlier channels run far hotter than the rest. When you try to store the cache in 2 bits — just four quantization levels — the lone outlier stretches the scale so wide that the ordinary channels all snap to the bottom level and effectively vanish. This is the textbook outlier problem in quantization, and it is why naive 2-bit KV caches fall apart. Earlier methods soften it with per-block scales — a different ruler for each small group of values — but a block that still contains an outlier is back to the same stretched ruler.

KVarN spins the tray instead. A Hadamard rotation is an orthogonal transform whose entries are all ±1; multiplying the keys and values by it mixes every channel into every other, so an outlier's magnitude gets spread thin across all dimensions rather than spiking in one. Because a Hadamard matrix is orthogonal the rotation is reversible: applied in matched pairs and undone downstream, it is designed to leave the attention result intact while the distribution the quantizer actually sees becomes flat, with no spikes. On top of that, a dual-scaling variance normalization corrects the per-token scales that the authors blame for the error in the first place, and the whole pipeline is calibration-free: it needs no sample dataset to tune.

The reason that last detail matters is error accumulation. Decoding is autoregressive — each new token reads the quantized cache, so a rounding error baked in at step 10 is still there, and compounding, at step 1,000. A 2-bit cache that looks fine on a short prompt can quietly drift on a long reasoning chain. By removing the outliers that cause the largest rounding errors and fixing the per-token scales, KVarN keeps the per-step error small enough that it doesn't snowball — which is exactly why it is evaluated on long generative benchmarks like MATH500 and AIME24 rather than one-shot accuracy.

Where the 8× actually comes from

The KV cache's size is pure bookkeeping: 2 (keys + values) × layers × KV-heads × head-dim × bytes-per-value, per token. Take an illustrative 80-layer model with 8 KV heads and a 128-dim head (illustrative — KVarN fixes no architecture). In FP16 that is 2 × 80 × 8 × 128 × 2 bytes ≈ 0.31 MB per token, so a 32,768-token context costs about 10.7 GB — often more than the model weights leave room for. Swap the 2 bytes for 2 bits (0.25 bytes) and the same context drops to ~1.3 GB — an 8× cut. The win is mechanical; the hard part is keeping accuracy at 2-bit, and that is the part the Hadamard rotation buys. Less KV memory directly loosens the binding constraint on serving throughput: the freed gigabytes become longer contexts or a bigger batch.

KV-cache scheme	Bits / value	Outlier handling	Calibration
FP16 baseline	16	none needed (full precision)	n/a
Per-block-scale 2-bit	~2	local scale per block — a block with an outlier still stretches (setup-dependent, illustrative)	often calibrated
KVarN (Hadamard-rotated 2-bit)	~2	rotation spreads outliers across all channels (paper)	calibration-free (paper)

A caveat worth stating plainly: KVarN reports state-of-the-art results at 2-bit, not lossless ones — 2-bit is still a lossy format, and "state-of-the-art" means best among 2-bit methods on the paper's benchmarks, not free. The Hadamard-rotation trick (sometimes called incoherence processing) is a known idea in weight quantization; KVarN's contribution is adapting it to the KV cache during decoding and pairing it with a calibration-free dual-scaling fix for per-token scales. How much you gain depends on how outlier-heavy your model's cache is in the first place — the spinning tray only helps if there were towering piles to level.

Goes deeper in: LLM Internals → Quantization → The Outlier Problem

Related explainers

TurboQuant — 2-bit KV cache with per-block scales — the per-block-scale approach KVarN's rotation is meant to beat
Quantization-conditioned attack — outlier injection — what happens when an outlier is weaponized against per-block scales
SP-KV — self-pruned KV cache — a complementary lever: drop tokens from the cache instead of shrinking each value

Continue in trackQuantization: The Outlier Problem

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based