The news. On June 5, 2026, a paper titled SigmaScale appeared on arXiv. It tackles a long-standing weakness of SVD-based weight compression: the result depends heavily on how you scale a matrix's rows and columns before truncating, and prior methods fix that scaling with a hand-derived formula. SigmaScale instead learns the scaling — two diagonal vectors optimized under an activation-aware loss — which lowers the effective rank of the weights so truncated SVD throws away less of what matters. The authors report it stays competitive on perplexity and zero-shot benchmarks for Llama 3.1 8B Instruct and Qwen3-8B. Read the paper →

Picture a sound engineer at a graphic equalizer — a row of sliders, one per frequency band. A finished song is a mix of all those bands, but most of what you actually hear sits in a handful of them; the rest are nearly silent. To shrink the file you could mute the quiet bands and barely notice. That muting is truncated SVD: a weight matrix is broken into ranked "bands" — its singular values — and the quietest are dropped. The catch is that a quiet band isn't always silent, so mute too many and the song goes flat. SigmaScale's move is to re-tune the EQ first — to learn how to scale the matrix so the energy piles into the first few bands, leaving an even quieter tail to cut.

Underneath the metaphor, the "bands" are the directions a weight matrix acts along, and their "loudness" is the matrix's singular values — the numbers a singular value decomposition sorts from largest to smallest. Truncated SVD keeps only the top k of them and stores two skinny matrices in place of one big one. How well that works depends entirely on how fast the singular values fall off: if a matrix spreads its energy evenly across hundreds of directions, cutting any of them hurts. Earlier SVD-compression methods reach for a fixed, analytic rescaling of the rows and columns — set by a formula rather than learned — to steepen that falloff before truncating.

SigmaScale learns the rescaling instead of deriving it. It optimizes just two vectors — one diagonal scale for the rows, one for the columns — under an activation-aware loss, so the truncation error is pushed onto the directions the model's activations barely use. The effect is to lower the matrix's effective intrinsic rank: the same energy now lives in fewer singular values, so the discarded tail carries less of what matters. The authors report it stays competitive on perplexity and zero-shot tasks for Llama 3.1 8B Instruct and Qwen3-8B — two learned vectors per matrix doing the work a hand-derived formula used to approximate.

This pulls a different lever than most "shrink the model" work. Quantization shrinks each number by storing it in fewer bits — dropping from 16-bit to 4-bit packs four times as many weights into the same memory. Low-rank compression leaves the bit-width alone and removes whole directions of redundancy instead. The two axes are orthogonal: you can rank-reduce a matrix and then quantize what's left.

32-bit float — virtually continuous

01234π = 3.14159Quantized: ≈ 3.1416Δ = 0.00000Representable values (0 → 4)

Where the savings come from

A weight matrix is pure bookkeeping: an m × n matrix stores m × n numbers. Take an illustrative 4096 × 4096 layer (illustrative — SigmaScale targets no fixed size): that's 16.8 million parameters. Truncated SVD with rank k stores a 4096 × k matrix plus a k × 4096 matrix — k × (4096 + 4096) numbers. Keep k = 1024 and you store about 8.4 million — a 2× shrink; keep k = 512 and it's 4.2 million, a 4× shrink. Storage falls linearly with k, so every direction you can safely drop is pure savings — and SigmaScale's whole job is to make more of them safe to drop by lowering the effective rank you need to hit the same quality.

Compression axisWhat it shrinksTypical methodThe knob
Quantization (bit-width)bits per numberGPTQ, AWQ, QATprecision: 16 → 8 → 4 bits
Low-rank (SVD)redundant directionstruncated SVD, SigmaScale (paper)rank k kept per matrix

A caveat worth stating plainly: low-rank compression only pays when the weights are low-rank to begin with. A matrix whose singular values barely decay has no quiet tail to cut, and no rescaling invents one — SigmaScale lowers the effective rank, it doesn't manufacture redundancy that was never there. And "competitive on the paper's benchmarks" means competitive, not lossless: truncation is still a lossy operation. What the two learned vectors buy is a better exchange rate — more compression per unit of quality lost — by aiming the error where the model can absorb it. For weights that are already dense and full-rank, bit-width quantization is still the lever that pays.

Goes deeper in: LLM Internals → Quantization → How Numbers Shrink

Related explainers

Continue in trackLLM Internals — Quantization: how shrinking a model's numbers works

Frequently Asked Questions