SOP paper — Hardware-aware per-layer PTQ at FP6
LLMThe news. On May 14, 2026, the SOP paper showed that a hardware-aware per-layer PTQ method, operating in the 4.5–6 bits-per-weight range, can beat a fixed FP8 codebook on reconstruction error — across six open model families. The trick is to stop treating quantization as one global decision and instead search a small codebook per layer, weighted by activations and allowed to promote the most sensitive layers a few extra bits. Read the paper →
Picture the touch-up painter. The simple way is the one most people use: take a sample card with every shade in the catalog — 256 shades on a fan deck — and walk it around the house, matching each patch against the nearest swatch. That's vanilla FP8. It's safe, predictable, and works fine; the misses are small because the deck is dense. But the deck is bulky, and most rooms only use a handful of those shades anyway.
Now picture two cheaper alternatives. The first is the naive 64-shade card — the same shades in every room, but a quarter as many. That's vanilla FP6: smaller, but the colors get visibly off in any room with an unusual hue because the deck wasn't built for that room. A naive uniform FP6 quantizer tends to drift behind FP8 on accuracy for exactly this reason: the codebook can't cover the long tail of weight values that any single layer might care about — which is the gap SOP's per-layer codebook is built to close.
SOP is the second alternative. Bring a 64-shade card to each room — but pick the 64 shades fresh per room, based on a quick walk-through of which shades the walls actually use. The bedroom's deck is heavy on warm beiges. The kitchen's deck is heavy on cool grays. Selected high-traffic rooms — where the eye lands most — get a promoted, bigger card to keep highlights crisp. Most rooms hold to 64; a few rooms get more; the average across the house lands somewhere in the 4.5–6 shades-per-bit-budget range the SOP paper studies. Because every room's deck was picked from that room's actual color distribution, the average match comes out closer than the universal 256-shade card managed — even though most rooms carry a quarter the swatches.
In model terms, that walk-through is the activation-weighted selection step: the calibration set tells SOP which weights see the most action in each layer, and those weights bias the per-layer codebook search. The promotion step is the paper's nod to the layer-sensitivity reality that the Quantization module covers: some layers are sharply more intolerant of low precision than others, and a uniform-bit budget either underserves those layers or overpays everywhere else. SOP shifts precision toward where it pays off.
Why FP6 beats FP8 reconstruction even at fewer bits
A useful intuition: reconstruction error is driven by how densely codebook entries cover each weight's local neighborhood, not by the total number of entries globally. A fixed FP8 codebook has to span the full dynamic range every layer might produce, so within any single layer's actual range only a fraction of those 256 slots end up well-placed. SOP's per-layer codebook concentrates all 64 entries around the values that layer actually produces. The local density is higher — and local density is what reconstruction error tends to depend on.
Layer promotion handles the corner case where uniform low-bit quantization leaks quality through the layers most sensitive to it. Spending an extra 1–2 bits on those selected layers costs little in average weight size but recovers much of the outlier-driven quality loss that uniform low-bit quantization can eat. The result, across six open model families: FP6 with SOP comes in below fixed FP8 on reconstruction, with a 1.5-bit-per-weight storage win.
The catch — and the paper names it — is that the method has to be hardware-aware. Codebook sizes are kept at powers of two; per-layer scales are stored in formats the kernel supports; the chosen layout maps onto an existing low-bit matmul implementation. None of this matters if the resulting kernel runs slower than the FP16 baseline. SOP's published numbers assume the kernel pipeline matches what current FP6/FP8 backends already support.
Goes deeper in: LLM Internals → Quantization → Modern Methods