The news. On June 18, 2026, inclusionAI posted Rethinking Shrinkage Bias in LLM FP4 Pretraining (arXiv 2606.20381). Its claim is uncomfortable for the whole industry's roadmap: the E2M1 4-bit float that NVIDIA Blackwell/Rubin-class and AMD MI350-series chips are built to multiply has a flaw baked into the format itself. Because E2M1's representable values are geometrically asymmetric, ordinary rounding introduces a systematic negative error — "shrinkage bias" — that accumulates multiplicatively across layers. Their fix, UFP4, reports lower BF16-relative loss degradation than E2M1 baselines from a 1.5B dense model up to a 124B-parameter MoE. Read the paper →

Picture the ruler. To store a number in E2M1 you must snap it to the nearest tick — there are only sixteen ticks in the whole format. But this ruler is strange: the ticks are packed tightly near zero and spread farther apart the higher you go. Near 0.5 the gap between ticks is small; out past 4 the gap is four times wider. Now round a value that sits in one of those wide upper gaps. The nearest tick below is close; the nearest tick above is far. Snap to the nearest tick and you almost always land on the lower one — so the value drifts a hair toward zero. Do that for one weight and it is nothing. Do it for every weight in the matrix and the whole tensor quietly deflates.

That is shrinkage bias in one picture, and the reason it is dangerous is that it is consistent. A random rounding error would scatter above and below the true value and mostly cancel out. This error is biased in one direction — on average it points down — so it does not cancel; it compounds, layer over layer, the same way a small percentage shaved off a balance every day eats a fortune. E2M1's bit layout is what forces those uneven ticks: one sign bit, two exponent bits, and a single mantissa bit, so the exponent (which sets the spacing) dominates and the lone mantissa bit can do almost nothing to fill the widening gaps.

bit layout (proportional width)FP32exp8bmantissa23bFull range + full precisionTF32exp8bmant10bTensorFloat-32 — FP32 range, FP16 precision, 8× throughputBF16exp8bmant7bSame range as FP32, less precisionFP16exp5bmantissa10bSmaller exp → can overflowFP8 E4M3exp4bmant3bForward pass — more precision, range ±448FP8 E5M2exp5bmant2bBackward pass — wider range ±57,344, less precisionINT8value (integer)8b256 levels, no float encoding
■ exponent→ determines range  · ■ mantissa→ determines precision  ·  TF32 = FP32 range + FP16 precision  ·  FP8 E4M3 = forward pass  ·  FP8 E5M2 = backward pass

UFP4's first move attacks where the bias bites hardest: the outliers stranded out in the wide bins. It applies a Random Hadamard Transform to all three training matrix multiplies — the forward pass and both backward passes. The Hadamard mix is an orthogonal rotation that smears a handful of large outlier values across many channels, so no single value sits alone out where the ticks are widest and the rounding step is largest. Spread the load and the worst-case shrink on any one number shrinks too.

Outliers force INT8 range to span –60 to 60 · most of the 256 grid slots fall on empty space

Weight value distribution-60-40-200204060
99.9%
0.1%
outliers
INT8 grid:-60 to 60 — 256 slots wasted on empty space
main distribution outliers INT8 grid

The second move fixes the rounding rule itself — but only where it matters. Round-to-nearest is what creates the directional bias, so UFP4 swaps in stochastic rounding: round up or down by a weighted coin flip whose odds match how close the value sits to each tick, so that the expected rounded value equals the true value and the systematic shrink disappears. The subtlety is that stochastic rounding adds noise, and you do not want extra noise on your weights — so UFP4 restricts it to the gradient computation (dY) only, where unbiasedness buys more than the noise costs, and leaves the forward weights on plain rounding. One transform to tame the outliers, one unbiased round on the gradients, and the format's built-in deflation is removed without changing the hardware.

Rounding strategyWhat it does on the E2M1 gridBiasUFP4 uses it
Round-to-nearest (baseline)snap to the closest ticksystematic shrink toward zeroNo — this is the problem
Stochastic rounding everywhereweighted coin-flip roundunbiased, but noisy on weightsPartly — too much noise if applied to all tensors
Random Hadamard Transformrotate to spread outliersshrinks the worst-case rounding stepYes — on the forward, dgrad, and wgrad GEMMs
UFP4 (RHT + SR on dY)mix outliers, then round gradients unbiasedremovedthe full recipe

Why the lopsided ruler shrinks values

Hold the format fixed and read off its actual ticks. Per sign, E2M1 can represent the magnitudes 0, 0.5, 1, 1.5, 2, 3, 4, 6 (the documented OCP microscaling E2M1 grid). Look at the gaps: from 0 to 2 the ticks are 0.5 apart, from 2 to 4 they jump to 1.0 apart, and the top bin from 4 to 6 is 2.0 wide — so the bin near 5 is 4× wider than the bin near 1. A value of 4.9 has its nearest tick at 4.0 (a step of 0.9) and its next tick at 6.0 (a step of 1.1), so it rounds down and loses 0.9 of magnitude; a value of 1.1 rounds to 1.0 and loses only 0.1. Because real weights pile up at small magnitudes where the ruler is fine and thin out into the wide upper bins, the large down-rounds out-weigh the small up-rounds, so the tensor's average magnitude falls. (The per-value steps here are illustrative of the mechanism; the paper's headline result is that UFP4 holds lower BF16-relative loss degradation than E2M1 baselines across 1.5B dense, 7.9B MoE, and 124B-parameter MoE scales.)

Goes deeper in: GPU & CUDA → Tensor Cores & Mixed Precision → Precision Formats

Related explainers

Continue in trackTensor Cores: which precision formats the hardware multiplies natively, and what each bit buys you

Frequently Asked Questions