The news. On June 18, 2026, inclusionAI posted Rethinking Shrinkage Bias in LLM FP4 Pretraining (arXiv 2606.20381). Its claim is uncomfortable for the whole industry's roadmap: the E2M1 4-bit float that NVIDIA Blackwell/Rubin-class and AMD MI350-series chips are built to multiply has a flaw baked into the format itself. Because E2M1's representable values are geometrically asymmetric, ordinary rounding introduces a systematic negative error — "shrinkage bias" — that accumulates multiplicatively across layers. Their fix, UFP4, reports lower BF16-relative loss degradation than E2M1 baselines from a 1.5B dense model up to a 124B-parameter MoE. Read the paper →
Picture the ruler. To store a number in E2M1 you must snap it to the nearest tick — there are only sixteen ticks in the whole format. But this ruler is strange: the ticks are packed tightly near zero and spread farther apart the higher you go. Near 0.5 the gap between ticks is small; out past 4 the gap is four times wider. Now round a value that sits in one of those wide upper gaps. The nearest tick below is close; the nearest tick above is far. Snap to the nearest tick and you almost always land on the lower one — so the value drifts a hair toward zero. Do that for one weight and it is nothing. Do it for every weight in the matrix and the whole tensor quietly deflates.
That is shrinkage bias in one picture, and the reason it is dangerous is that it is consistent. A random rounding error would scatter above and below the true value and mostly cancel out. This error is biased in one direction — on average it points down — so it does not cancel; it compounds, layer over layer, the same way a small percentage shaved off a balance every day eats a fortune. E2M1's bit layout is what forces those uneven ticks: one sign bit, two exponent bits, and a single mantissa bit, so the exponent (which sets the spacing) dominates and the lone mantissa bit can do almost nothing to fill the widening gaps.
UFP4's first move attacks where the bias bites hardest: the outliers stranded out in the wide bins. It applies a Random Hadamard Transform to all three training matrix multiplies — the forward pass and both backward passes. The Hadamard mix is an orthogonal rotation that smears a handful of large outlier values across many channels, so no single value sits alone out where the ticks are widest and the rounding step is largest. Spread the load and the worst-case shrink on any one number shrinks too.
Outliers force INT8 range to span –60 to 60 · most of the 256 grid slots fall on empty space
The second move fixes the rounding rule itself — but only where it matters. Round-to-nearest is what creates the directional bias, so UFP4 swaps in stochastic rounding: round up or down by a weighted coin flip whose odds match how close the value sits to each tick, so that the expected rounded value equals the true value and the systematic shrink disappears. The subtlety is that stochastic rounding adds noise, and you do not want extra noise on your weights — so UFP4 restricts it to the gradient computation (dY) only, where unbiasedness buys more than the noise costs, and leaves the forward weights on plain rounding. One transform to tame the outliers, one unbiased round on the gradients, and the format's built-in deflation is removed without changing the hardware.
| Rounding strategy | What it does on the E2M1 grid | Bias | UFP4 uses it |
|---|---|---|---|
| Round-to-nearest (baseline) | snap to the closest tick | systematic shrink toward zero | No — this is the problem |
| Stochastic rounding everywhere | weighted coin-flip round | unbiased, but noisy on weights | Partly — too much noise if applied to all tensors |
| Random Hadamard Transform | rotate to spread outliers | shrinks the worst-case rounding step | Yes — on the forward, dgrad, and wgrad GEMMs |
| UFP4 (RHT + SR on dY) | mix outliers, then round gradients unbiased | removed | the full recipe |
Why the lopsided ruler shrinks values
Hold the format fixed and read off its actual ticks. Per sign, E2M1 can represent the magnitudes 0, 0.5, 1, 1.5, 2, 3, 4, 6 (the documented OCP microscaling E2M1 grid). Look at the gaps: from 0 to 2 the ticks are 0.5 apart, from 2 to 4 they jump to 1.0 apart, and the top bin from 4 to 6 is 2.0 wide — so the bin near 5 is 4× wider than the bin near 1. A value of 4.9 has its nearest tick at 4.0 (a step of 0.9) and its next tick at 6.0 (a step of 1.1), so it rounds down and loses 0.9 of magnitude; a value of 1.1 rounds to 1.0 and loses only 0.1. Because real weights pile up at small magnitudes where the ruler is fine and thin out into the wide upper bins, the large down-rounds out-weigh the small up-rounds, so the tensor's average magnitude falls. (The per-value steps here are illustrative of the mechanism; the paper's headline result is that UFP4 holds lower BF16-relative loss degradation than E2M1 baselines across 1.5B dense, 7.9B MoE, and 124B-parameter MoE scales.)
Goes deeper in: GPU & CUDA → Tensor Cores & Mixed Precision → Precision Formats
Related explainers
- LongLive-2.0 — NVFP4 W4A4 across training and inference — another FP4 training story; UFP4 explains why the naive version of that path degrades and how to stop it.
- KVarN — Hadamard rotation for a 2-bit KV cache — the same Hadamard outlier-spreading trick, aimed at the KV cache instead of training GEMMs.
- Gemma 4 QAT — Quantization-Aware Training — the other way to make low-bit training honest: teach the model about the grid while it trains.