OScaR — Token Norm Imbalance
LLMToken Norm Imbalance (TNI) is the OScaR paper's name for outsized L2 norms on a few tokens along the KV cache's sequence axis — the dominant reason INT2 KV cache compression fails when only channel-wise rotation is in place. The paper's diagnostic reframes the axis the outliers live on, which decides which fix actually works.
The news. On May 19, 2026, Zunhai Su, Rui Yang, Chao Zhang and eleven co-authors posted OScaR — The Occam's Razor for Extreme KV Cache Quantization. The paper reports near-lossless INT2 KV cache with 3.0× decode speedup over BF16 FlashDecoding-v2, 5.3× KV memory reduction, and 4.1× end-to-end throughput — and the headline finding is that the failure mode prior work attacked (channel-wise outliers) was not the dominant one. Read the paper →
Picture the KV cache as a long mixing board. The channel sliders along the top — one per attention feature dimension — are already balanced; the standard trick of multiplying by a Hadamard matrix takes care of that. What channel rotation cannot see is the row of per-track volume sliders below: every token in the sequence has its own L2 norm, and a few specific tokens are turned up far louder than the rest. Squeezing the whole board into INT2 means encoding every slider position with one of four levels, and the loud tracks blow the headroom for everyone.
OScaR names this slider-row variance Token Norm Imbalance. The diagnostic matters because orthogonal rotation across the feature axis is structurally incapable of fixing it. Rotations act within a token's feature vector; they cannot redistribute energy between tokens. So even a perfect channel-wise rotation leaves the loud tracks just as loud, and INT2 quantization keeps collapsing the small ones.
The fix needs two moves on the sequence axis itself. Canalized Rotation concentrates the outlier energy into a designated subspace — the "canal" — so the variance is bundled rather than smeared, making it easy to target. Omni-Token Scaling then applies a per-token scale factor before INT2 encoding, the way a per-track attenuator works on a mixing board: the loud track gets its own ceiling, the quiet tracks keep their full resolution. The paper describes both as training-free at quantization time.
Where this earns its keep is when the KV cache is the bottleneck. The paper benchmarks against BF16 FlashDecoding-v2 — already a state-of-the-art decode kernel — and still reports an "up to" 3.0× decode speedup because INT2 KV reads use one-eighth the memory bandwidth of BF16, and decode is overwhelmingly memory-bound. The diagnostic reframe is what makes the speedup possible: targeting the failure mode that dominates the loss at INT2 is what protects quality.
A worked example: 70B model, 64K context
Take a Llama-class 70B-parameter model with 80 attention layers and 8 KV heads of 128 channels each (GQA). At BF16, one token's KV state is 80 × 2 (K and V) × 8 × 128 × 2 bytes = 320 KB. At 64,000 tokens of context, the KV cache for one request is about ~21 GB (illustrative — varies with head/dim sharding and per-engine packing). OScaR's INT2 + canal + OTS lands at the paper's 5.3× memory reduction: ~21 GB → ~4 GB. The paper reports a corresponding 4.1× end-to-end throughput, with the 3.0× decode kernel speedup feeding it.
Goes deeper in: LLM Internals → KV Cache → Memory and LLM Internals → Quantization → Outlier Problem.