What is SOP quantization?

SOP — Scaled Outer Product — is a post-training quantization method that searches a different codebook per layer in the 4.5–6 bits-per-weight range. The search is biased by activation weights from a calibration set, and selected sensitive layers are promoted to a higher bit budget. Across six open model families, FP6 with SOP beats fixed FP8 on reconstruction error using 1.5 fewer bits per weight.

What does 'hardware-aware' mean in SOP?

Hardware-aware means the codebook search respects what real GPU kernels can execute fast: codebook sizes are powers of two, scale factors match formats the tensor cores already support, and the resulting layout maps onto an existing low-bit matmul implementation. The compression number doesn't matter if the resulting kernel runs slower than FP16 — SOP keeps the choice space inside what the hardware can actually accelerate.

SOP paper — Hardware-aware per-layer PTQ at FP6

LLM

learnaivisually.com/ai-explained/sop-ptq-fp6-beats-fp8

Jargon

PTQ: Post-Training Quantization. The model is already trained at full precision; PTQ rewrites its weights into a lower-bit representation without any further gradient updates.
SOP: Scaled Outer Product — the paper's method name: a per-layer search over fixed and dynamic codebook pairs with activation-weighted selection, in the 4.5–6 bits-per-weight range.
Codebook: The finite set of values a quantized weight is allowed to take — roughly 256 distinct values at FP8 and 64 at FP6. Every real-valued weight gets snapped to the nearest entry.
FP6 / FP8: Low-bit floating-point formats — 6 and 8 bits per value, with fewer mantissa bits than FP16. Tensor Cores treat them as a first-class input format on recent GPUs.
Reconstruction error: How far the quantized weight matrix sits from the original full-precision one, typically measured by Frobenius norm or layer-output MSE — the closest proxy to "did quantization break the model."
Activation-weighted selection: Pick codebook entries by looking at which weights see the largest activations on a calibration set, so heavily-used weights get more precise slots. Related to AWQ's outlier-aware rescaling.
Layer promotion: Spend a few extra bits on the layers that suffer most under quantization and pay it back by saving bits elsewhere; SOP's 4.5–6 bits/weight range is the per-layer budget after promotion averages out.
Hardware-aware: The search respects what real GPU kernels can run fast — codebook sizes are powers of two, scale factors match supported numeric formats, and the chosen layout maps onto an existing low-bit matmul implementation.

The news. On May 14, 2026, the SOP paper showed that a hardware-aware per-layer PTQ method, operating in the 4.5–6 bits-per-weight range, can beat a fixed FP8 codebook on reconstruction error — across six open model families. The trick is to stop treating quantization as one global decision and instead search a small codebook per layer, weighted by activations and allowed to promote the most sensitive layers a few extra bits. Read the paper →

Picture the touch-up painter. The simple way is the one most people use: take a sample card with every shade in the catalog — 256 shades on a fan deck — and walk it around the house, matching each patch against the nearest swatch. That's vanilla FP8. It's safe, predictable, and works fine; the misses are small because the deck is dense. But the deck is bulky, and most rooms only use a handful of those shades anyway.

Now picture two cheaper alternatives. The first is the naive 64-shade card — the same shades in every room, but a quarter as many. That's vanilla FP6: smaller, but the colors get visibly off in any room with an unusual hue because the deck wasn't built for that room. A naive uniform FP6 quantizer tends to drift behind FP8 on accuracy for exactly this reason: the codebook can't cover the long tail of weight values that any single layer might care about — which is the gap SOP's per-layer codebook is built to close.

SOP is the second alternative. Bring a 64-shade card to each room — but pick the 64 shades fresh per room, based on a quick walk-through of which shades the walls actually use. The bedroom's deck is heavy on warm beiges. The kitchen's deck is heavy on cool grays. Selected high-traffic rooms — where the eye lands most — get a promoted, bigger card to keep highlights crisp. Most rooms hold to 64; a few rooms get more; the average across the house lands somewhere in the 4.5–6 shades-per-bit-budget range the SOP paper studies. Because every room's deck was picked from that room's actual color distribution, the average match comes out closer than the universal 256-shade card managed — even though most rooms carry a quarter the swatches.

In model terms, that walk-through is the activation-weighted selection step: the calibration set tells SOP which weights see the most action in each layer, and those weights bias the per-layer codebook search. The promotion step is the paper's nod to the layer-sensitivity reality that the Quantization module covers: some layers are sharply more intolerant of low precision than others, and a uniform-bit budget either underserves those layers or overpays everywhere else. SOP shifts precision toward where it pays off.

Why FP6 beats FP8 reconstruction even at fewer bits

A useful intuition: reconstruction error is driven by how densely codebook entries cover each weight's local neighborhood, not by the total number of entries globally. A fixed FP8 codebook has to span the full dynamic range every layer might produce, so within any single layer's actual range only a fraction of those 256 slots end up well-placed. SOP's per-layer codebook concentrates all 64 entries around the values that layer actually produces. The local density is higher — and local density is what reconstruction error tends to depend on.

Layer promotion handles the corner case where uniform low-bit quantization leaks quality through the layers most sensitive to it. Spending an extra 1–2 bits on those selected layers costs little in average weight size but recovers much of the outlier-driven quality loss that uniform low-bit quantization can eat. The result, across six open model families: FP6 with SOP comes in below fixed FP8 on reconstruction, with a 1.5-bit-per-weight storage win.

The catch — and the paper names it — is that the method has to be hardware-aware. Codebook sizes are kept at powers of two; per-layer scales are stored in formats the kernel supports; the chosen layout maps onto an existing low-bit matmul implementation. None of this matters if the resulting kernel runs slower than the FP16 baseline. SOP's published numbers assume the kernel pipeline matches what current FP6/FP8 backends already support.

Goes deeper in: LLM Internals → Quantization → Modern Methods

SOP paper — Hardware-aware per-layer PTQ at FP6

Why FP6 beats FP8 reconstruction even at fewer bits

Frequently Asked Questions