What is shrinkage bias in E2M1 FP4 pretraining?

Shrinkage bias is a systematic negative rounding error in the E2M1 4-bit float format. Because E2M1's representable values are spaced geometrically — close together near zero, far apart at larger magnitudes — rounding a value to the nearest representable tick lands on the lower tick more often than the higher one, so magnitudes drift toward zero. The error is biased in one direction — toward zero — so instead of cancelling out it accumulates multiplicatively across a network's layers, and the paper (arXiv 2606.20381) shows it degrades the tested FP4-pretrained models relative to a BF16 baseline.

How does UFP4 remove the shrinkage bias?

UFP4 makes two changes without touching the hardware. First, it applies a Random Hadamard Transform to all three training matrix multiplies (forward, dgrad, wgrad), which spreads a few large outlier values across many channels so none is stranded out where E2M1's bins are widest. Second, it replaces round-to-nearest with stochastic rounding — a weighted coin-flip round whose expected value equals the true value, so the systematic shrink disappears — but restricts that stochastic rounding to the gradient computation (dY) only, where unbiasedness is worth more than the extra noise.

Why does FP4 shrinkage bias matter for next-generation hardware?

E2M1 is the FP4 format that NVIDIA Blackwell/Rubin-class and AMD MI350-series accelerators are designed to multiply natively, and FP4 is the headline precision for training the next wave of large models. If the format itself rounds with a downward bias, a model pretrained in FP4 can silently lose quality versus a 16-bit baseline. Naming the bias and giving a cheap, hardware-compatible fix — validated from a 1.5B dense model up to a 124B-parameter MoE — is a meaningful step toward trustworthy FP4 pretraining, instead of defaulting to more expensive 16-bit training.

UFP4 fixes FP4 pretraining's shrinkage bias — E2M1 shrinkage bias

Jargon

FP4 / E2M1: FP4 is a 4-bit floating-point number. E2M1 is its layout — 1 sign bit, 2 exponent bits, 1 mantissa bit — which can represent only 16 distinct values. It is the format the newest training hardware multiplies natively. See precision formats.
Shrinkage bias: The paper's name for a systematic negative rounding error: because E2M1's representable values are spaced unevenly, rounding pulls magnitudes toward zero on average rather than scattering above and below.
Representable bins: The fixed set of values a format can store. Between two of them is a bin; any real number gets snapped to the nearer edge. In E2M1 the bins get geometrically wider as the magnitude grows.
Stochastic rounding (SR): Instead of always rounding to the nearest value, round up or down by a weighted coin flip so the expected result equals the true value. It trades a tiny bit of noise for an unbiased average. See the quantization process.
Random Hadamard Transform (RHT): An orthogonal "mixing" multiply that spreads a few large outlier values across many channels, so no single value is stranded out where the bins are widest. The same family of rotation trick shows up in KV-cache compression.
The three training GEMMs: One training step runs three matrix multiplies: the forward pass, dgrad (gradient w.r.t. the inputs), and wgrad (gradient w.r.t. the weights). UFP4 applies the Hadamard mix to all three.
BF16-relative loss degradation: How much worse the FP4-trained model's loss is compared with the same model trained in 16-bit BF16. Lower is better; the goal of low-bit training is to make this gap vanish.

The news. On June 18, 2026, inclusionAI posted Rethinking Shrinkage Bias in LLM FP4 Pretraining (arXiv 2606.20381). Its claim is uncomfortable for the whole industry's roadmap: the E2M1 4-bit float that NVIDIA Blackwell/Rubin-class and AMD MI350-series chips are built to multiply has a flaw baked into the format itself. Because E2M1's representable values are geometrically asymmetric, ordinary rounding introduces a systematic negative error — "shrinkage bias" — that accumulates multiplicatively across layers. Their fix, UFP4, reports lower BF16-relative loss degradation than E2M1 baselines from a 1.5B dense model up to a 124B-parameter MoE. Read the paper →

Picture the ruler. To store a number in E2M1 you must snap it to the nearest tick — there are only sixteen ticks in the whole format. But this ruler is strange: the ticks are packed tightly near zero and spread farther apart the higher you go. Near 0.5 the gap between ticks is small; out past 4 the gap is four times wider. Now round a value that sits in one of those wide upper gaps. The nearest tick below is close; the nearest tick above is far. Snap to the nearest tick and you almost always land on the lower one — so the value drifts a hair toward zero. Do that for one weight and it is nothing. Do it for every weight in the matrix and the whole tensor quietly deflates.

That is shrinkage bias in one picture, and the reason it is dangerous is that it is consistent. A random rounding error would scatter above and below the true value and mostly cancel out. This error is biased in one direction — on average it points down — so it does not cancel; it compounds, layer over layer, the same way a small percentage shaved off a balance every day eats a fortune. E2M1's bit layout is what forces those uneven ticks: one sign bit, two exponent bits, and a single mantissa bit, so the exponent (which sets the spacing) dominates and the lone mantissa bit can do almost nothing to fill the widening gaps.

UFP4's first move attacks where the bias bites hardest: the outliers stranded out in the wide bins. It applies a Random Hadamard Transform to all three training matrix multiplies — the forward pass and both backward passes. The Hadamard mix is an orthogonal rotation that smears a handful of large outlier values across many channels, so no single value sits alone out where the ticks are widest and the rounding step is largest. Spread the load and the worst-case shrink on any one number shrinks too.

Outliers force INT8 range to span –60 to 60 · most of the 256 grid slots fall on empty space

The second move fixes the rounding rule itself — but only where it matters. Round-to-nearest is what creates the directional bias, so UFP4 swaps in stochastic rounding: round up or down by a weighted coin flip whose odds match how close the value sits to each tick, so that the expected rounded value equals the true value and the systematic shrink disappears. The subtlety is that stochastic rounding adds noise, and you do not want extra noise on your weights — so UFP4 restricts it to the gradient computation (dY) only, where unbiasedness buys more than the noise costs, and leaves the forward weights on plain rounding. One transform to tame the outliers, one unbiased round on the gradients, and the format's built-in deflation is removed without changing the hardware.

Rounding strategy	What it does on the E2M1 grid	Bias	UFP4 uses it
Round-to-nearest (baseline)	snap to the closest tick	systematic shrink toward zero	No — this is the problem
Stochastic rounding everywhere	weighted coin-flip round	unbiased, but noisy on weights	Partly — too much noise if applied to all tensors
Random Hadamard Transform	rotate to spread outliers	shrinks the worst-case rounding step	Yes — on the forward, dgrad, and wgrad GEMMs
UFP4 (RHT + SR on dY)	mix outliers, then round gradients unbiased	removed	the full recipe

Why the lopsided ruler shrinks values

Hold the format fixed and read off its actual ticks. Per sign, E2M1 can represent the magnitudes 0, 0.5, 1, 1.5, 2, 3, 4, 6 (the documented OCP microscaling E2M1 grid). Look at the gaps: from 0 to 2 the ticks are 0.5 apart, from 2 to 4 they jump to 1.0 apart, and the top bin from 4 to 6 is 2.0 wide — so the bin near 5 is 4× wider than the bin near 1. A value of 4.9 has its nearest tick at 4.0 (a step of 0.9) and its next tick at 6.0 (a step of 1.1), so it rounds down and loses 0.9 of magnitude; a value of 1.1 rounds to 1.0 and loses only 0.1. Because real weights pile up at small magnitudes where the ruler is fine and thin out into the wide upper bins, the large down-rounds out-weigh the small up-rounds, so the tensor's average magnitude falls. (The per-value steps here are illustrative of the mechanism; the paper's headline result is that UFP4 holds lower BF16-relative loss degradation than E2M1 baselines across 1.5B dense, 7.9B MoE, and 124B-parameter MoE scales.)

Goes deeper in: GPU & CUDA → Tensor Cores & Mixed Precision → Precision Formats

Related explainers

LongLive-2.0 — NVFP4 W4A4 across training and inference — another FP4 training story; UFP4 explains why the naive version of that path degrades and how to stop it.
KVarN — Hadamard rotation for a 2-bit KV cache — the same Hadamard outlier-spreading trick, aimed at the KV cache instead of training GEMMs.
Gemma 4 QAT — Quantization-Aware Training — the other way to make low-bit training honest: teach the model about the grid while it trains.

Continue in trackTensor Cores: which precision formats the hardware multiplies natively, and what each bit buys you

Why the lopsided ruler shrinks values

Related explainers

Frequently Asked Questions