The news. On June 5, 2026, a paper titled SigmaScale appeared on arXiv. It tackles a long-standing weakness of SVD-based weight compression: the result depends heavily on how you scale a matrix's rows and columns before truncating, and prior methods fix that scaling with a hand-derived formula. SigmaScale instead learns the scaling — two diagonal vectors optimized under an activation-aware loss — which lowers the effective rank of the weights so truncated SVD throws away less of what matters. The authors report it stays competitive on perplexity and zero-shot benchmarks for Llama 3.1 8B Instruct and Qwen3-8B. Read the paper →
Picture a sound engineer at a graphic equalizer — a row of sliders, one per frequency band. A finished song is a mix of all those bands, but most of what you actually hear sits in a handful of them; the rest are nearly silent. To shrink the file you could mute the quiet bands and barely notice. That muting is truncated SVD: a weight matrix is broken into ranked "bands" — its singular values — and the quietest are dropped. The catch is that a quiet band isn't always silent, so mute too many and the song goes flat. SigmaScale's move is to re-tune the EQ first — to learn how to scale the matrix so the energy piles into the first few bands, leaving an even quieter tail to cut.
Underneath the metaphor, the "bands" are the directions a weight matrix acts along, and their "loudness" is the matrix's singular values — the numbers a singular value decomposition sorts from largest to smallest. Truncated SVD keeps only the top k of them and stores two skinny matrices in place of one big one. How well that works depends entirely on how fast the singular values fall off: if a matrix spreads its energy evenly across hundreds of directions, cutting any of them hurts. Earlier SVD-compression methods reach for a fixed, analytic rescaling of the rows and columns — set by a formula rather than learned — to steepen that falloff before truncating.
SigmaScale learns the rescaling instead of deriving it. It optimizes just two vectors — one diagonal scale for the rows, one for the columns — under an activation-aware loss, so the truncation error is pushed onto the directions the model's activations barely use. The effect is to lower the matrix's effective intrinsic rank: the same energy now lives in fewer singular values, so the discarded tail carries less of what matters. The authors report it stays competitive on perplexity and zero-shot tasks for Llama 3.1 8B Instruct and Qwen3-8B — two learned vectors per matrix doing the work a hand-derived formula used to approximate.
This pulls a different lever than most "shrink the model" work. Quantization shrinks each number by storing it in fewer bits — dropping from 16-bit to 4-bit packs four times as many weights into the same memory. Low-rank compression leaves the bit-width alone and removes whole directions of redundancy instead. The two axes are orthogonal: you can rank-reduce a matrix and then quantize what's left.
32-bit float — virtually continuous
Where the savings come from
A weight matrix is pure bookkeeping: an m × n matrix stores m × n numbers. Take an illustrative 4096 × 4096 layer (illustrative — SigmaScale targets no fixed size): that's 16.8 million parameters. Truncated SVD with rank k stores a 4096 × k matrix plus a k × 4096 matrix — k × (4096 + 4096) numbers. Keep k = 1024 and you store about 8.4 million — a 2× shrink; keep k = 512 and it's 4.2 million, a 4× shrink. Storage falls linearly with k, so every direction you can safely drop is pure savings — and SigmaScale's whole job is to make more of them safe to drop by lowering the effective rank you need to hit the same quality.
| Compression axis | What it shrinks | Typical method | The knob |
|---|---|---|---|
| Quantization (bit-width) | bits per number | GPTQ, AWQ, QAT | precision: 16 → 8 → 4 bits |
| Low-rank (SVD) | redundant directions | truncated SVD, SigmaScale (paper) | rank k kept per matrix |
A caveat worth stating plainly: low-rank compression only pays when the weights are low-rank to begin with. A matrix whose singular values barely decay has no quiet tail to cut, and no rescaling invents one — SigmaScale lowers the effective rank, it doesn't manufacture redundancy that was never there. And "competitive on the paper's benchmarks" means competitive, not lossless: truncation is still a lossy operation. What the two learned vectors buy is a better exchange rate — more compression per unit of quality lost — by aiming the error where the model can absorb it. For weights that are already dense and full-rank, bit-width quantization is still the lever that pays.
Goes deeper in: LLM Internals → Quantization → How Numbers Shrink
Related explainers
- Gemma 4 QAT — Quantization-Aware Training — the other way to shrink weights: cut the bit-width instead of the rank, and train the model to survive it
- Code2LoRA — hypernetwork-generated adapters — low-rank factorization used to add knowledge cheaply, the mirror image of using it to remove redundancy
- LongLive 2.0 — NVFP4 W4A4 — pushing the bit-width axis to its limit, where SigmaScale pushes the rank axis instead