What is Token Norm Imbalance (TNI)?

Token Norm Imbalance is the OScaR paper's name for variance along the sequence (token) axis of the KV cache: a small number of tokens carry L2 norms much larger than the surrounding tokens. The imbalance is dimensional — it lives between tokens, not within a single token's feature vector — so the standard pre-quantization trick of orthogonal channel rotation cannot reduce it. OScaR identifies TNI as the dominant failure mode for INT2 KV cache quantization.

Why does INT2 KV cache compression fail on Token Norm Imbalance?

INT2 has only four levels per value, so the quantization range has to span the entire tensor. If even a few tokens have outsized L2 norms, the range stretches to fit them and the rest of the tokens collapse onto one or two levels, losing the resolution where most of the cache lives. Channel-wise rotation, the prior pre-quantization fix, leaves cross-token variance untouched because rotations act within each token's feature vector and cannot redistribute energy between tokens.

How does OScaR relate to Hadamard rotation and prior INT4 KV schemes?

Prior INT4 KV cache schemes (QuaRot, SpinQuant, KIVI's channel-side variants) used orthogonal channel-axis rotations — typically Hadamard — to flatten per-channel outliers. OScaR builds on the per-channel paradigm but redirects the rotation to address the sequence-axis variance: Canalized Rotation concentrates the high-variance subspace, and Omni-Token Scaling applies a per-token scale factor so each token quantizes within its own range. The paper describes the combination as training-free at quantization time and unlocks INT2, where prior work topped out at INT4.

OScaR paper — Token Norm Imbalance

OScaR — Token Norm Imbalance

LLM

learnaivisually.com/ai-explained/oscar-token-norm-imbalance

Jargon

KV cache: The key-value cache stores attention K and V tensors for every past token so decode doesn't recompute them. It's the dominant memory consumer in long-context serving. See LLM Internals → KV Cache → Memory.
INT2 quantization: Encoding each value with only 2 bits, i.e. one of four levels. Extreme compression, but the four levels have to cover the full value range — a single outlier stretches the range and crushes everything else. See LLM Internals → Quantization → Outlier Problem.
channel-wise rotation: Multiplying the KV tensor by an orthogonal matrix (typically Hadamard) along the channel/feature dimension so per-channel outliers spread into many smaller ones. It's the standard pre-quantization step prior INT4 KV cache schemes used — but it only addresses channel-axis variance.
Token Norm Imbalance (TNI): The OScaR paper's name for variance along the sequence axis: a small number of tokens have L2 norms much larger than the rest. Channel rotation leaves this untouched because it acts within each token, not across tokens.
Canalized Rotation: OScaR's first mechanism. A rotation described as routing the high-variance subspace into a dedicated channel band (the "canal") so the outlier energy is concentrated, not spread, and downstream scaling can target it directly.
Omni-Token Scaling (OTS): OScaR's second mechanism. A per-token scale factor applied before INT2 encoding so each token quantizes within its own dynamic range, leaving small tokens unaffected by neighbours' spikes.

Token Norm Imbalance (TNI) is the OScaR paper's name for outsized L2 norms on a few tokens along the KV cache's sequence axis — the dominant reason INT2 KV cache compression fails when only channel-wise rotation is in place. The paper's diagnostic reframes the axis the outliers live on, which decides which fix actually works.

The news. On May 19, 2026, Zunhai Su, Rui Yang, Chao Zhang and eleven co-authors posted OScaR — The Occam's Razor for Extreme KV Cache Quantization. The paper reports near-lossless INT2 KV cache with 3.0× decode speedup over BF16 FlashDecoding-v2, 5.3× KV memory reduction, and 4.1× end-to-end throughput — and the headline finding is that the failure mode prior work attacked (channel-wise outliers) was not the dominant one. Read the paper →

Picture the KV cache as a long mixing board. The channel sliders along the top — one per attention feature dimension — are already balanced; the standard trick of multiplying by a Hadamard matrix takes care of that. What channel rotation cannot see is the row of per-track volume sliders below: every token in the sequence has its own L2 norm, and a few specific tokens are turned up far louder than the rest. Squeezing the whole board into INT2 means encoding every slider position with one of four levels, and the loud tracks blow the headroom for everyone.

OScaR names this slider-row variance Token Norm Imbalance. The diagnostic matters because orthogonal rotation across the feature axis is structurally incapable of fixing it. Rotations act within a token's feature vector; they cannot redistribute energy between tokens. So even a perfect channel-wise rotation leaves the loud tracks just as loud, and INT2 quantization keeps collapsing the small ones.

The fix needs two moves on the sequence axis itself. Canalized Rotation concentrates the outlier energy into a designated subspace — the "canal" — so the variance is bundled rather than smeared, making it easy to target. Omni-Token Scaling then applies a per-token scale factor before INT2 encoding, the way a per-track attenuator works on a mixing board: the loud track gets its own ceiling, the quiet tracks keep their full resolution. The paper describes both as training-free at quantization time.

Where this earns its keep is when the KV cache is the bottleneck. The paper benchmarks against BF16 FlashDecoding-v2 — already a state-of-the-art decode kernel — and still reports an "up to" 3.0× decode speedup because INT2 KV reads use one-eighth the memory bandwidth of BF16, and decode is overwhelmingly memory-bound. The diagnostic reframe is what makes the speedup possible: targeting the failure mode that dominates the loss at INT2 is what protects quality.

A worked example: 70B model, 64K context

Take a Llama-class 70B-parameter model with 80 attention layers and 8 KV heads of 128 channels each (GQA). At BF16, one token's KV state is 80 × 2 (K and V) × 8 × 128 × 2 bytes = 320 KB. At 64,000 tokens of context, the KV cache for one request is about ~21 GB (illustrative — varies with head/dim sharding and per-engine packing). OScaR's INT2 + canal + OTS lands at the paper's 5.3× memory reduction: ~21 GB → ~4 GB. The paper reports a corresponding 4.1× end-to-end throughput, with the 3.0× decode kernel speedup feeding it.

Goes deeper in: LLM Internals → KV Cache → Memory and LLM Internals → Quantization → Outlier Problem.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based