Mix-Quant — NVFP4 prefill + BF16 decode
LLMThe news. On May 19, 2026, a five-author group (Haiquan Lu, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang) posted Mix-Quant to arXiv. The paper applies NVFP4 quantization to the compute-bound prefilling phase while preserving BF16 during decoding, and reports up to 3× prefill speedup with task performance largely preserved across long-context and agentic benchmarks. The recipe sits between two prior camps: W4A16 weight-only (which leaves activations alone) and W4A4 everywhere (LongLive-2.0). Read the paper →
Why prefill and decode aren't the same problem
Picture the bakery. The press stamps out a whole tray of cookies in one squeeze — the harder the press, the faster the tray gets done. The piping bag frosts cookies one at a time, and however strong your grip is, the next cookie still has to travel down the conveyor before you can touch it. The press is compute-bound. The piping bag is conveyor-bound.
Inference works the same way. Prefill processes the entire prompt — say, 8,000 tokens — in one matmul. The weights are loaded from HBM once and reused across 8,000 tokens' worth of work; arithmetic intensity is high, and the tensor cores are pegged. Decode is the opposite: for every output token, the model loads its full weight matrix from HBM and does one token's worth of math. The tensor cores spend most of their cycles waiting for weights to arrive.
The roofline model draws this with two dots on one chart. Prefill sits up under the compute ceiling, where peak FLOP rate sets the speed limit. Decode sits down on the bandwidth slope, where HBM bytes-per-second sets the speed limit. Raising the compute ceiling pulls prefill up. Decode doesn't move — its ceiling isn't the ceiling it's standing under.
Prefill vs Decode on the Roofline
Same GPU, fundamentally different bottlenecks
What NVFP4 actually changes — and what it doesn't
NVFP4 is a block-scaled 4-bit float exposed on Blackwell tensor cores. Versus BF16, two things happen at once: the matmul does roughly 4× more FLOPs per second on the same tensor cores (that's the "stronger press handle"), and each weight or activation occupies 4 bits instead of 16 — so the HBM bytes per matmul also drop.
For the press — prefill — both of those wins compound. The matmul ran out of compute, so 4× faster math hits the wall-clock. The paper reports up to 3× prefill speedup at NVFP4 vs uniform BF16 — the headline number lands just below the 4× theoretical because dequant overhead, attention kernels at higher precision, and end-to-end overheads eat some of the budget. The remaining gap shows up as preserved task performance, which is the entire point.
For the piping bag — decode — the same recipe is unappealing. Decode was already waiting on HBM bytes, so the lower byte footprint of NVFP4 weights does help bandwidth a bit (less data to read per token). But the 4× faster math doesn't help at all: the tensor cores were idle. Meanwhile decode is where output-token quality is decided: each token conditions on all the prior tokens, so any precision-induced drift compounds with horizon length. The reward for quantizing decode is small. The risk is large. Spend the precision budget where the reward is large and the risk is small.
Same model, two number formats
This is where the recipe stops being purely an inference trick. Mix-Quant runs at NVFP4 for prefill and BF16 for decode — the same set of weights, with a calibration step that aligns the two regimes across the request lifetime (per the paper's setup notes).
That phase-boundary plumbing is the part that's specific to Mix-Quant. The high-level idea — pick the precision per phase, not per model — generalizes to other splits: weights vs activations, attention vs FFN, prompt-cache prefill vs uncached prefill. The paper's contribution is making the prefill/decode split work cleanly end-to-end with a single set of weights.
Where the recipe earns its keep is agentic LLM workloads — long contexts, many tool calls, lots of cached prompt prefixes — where prefill is a big slice of wall-clock. The longer the prefill share, the larger the end-to-end benefit. For a typical 8,000-token-in / 200-token-out request, prefill is roughly 10–25% of wall-clock on a single GPU (this band is (setup-dependent, illustrative)); cut that to a third and the request finishes ~7–17% faster. For agentic chains that re-do prefill on every step, the per-step saving stacks across the chain.
| Recipe | Prefill precision | Decode precision | KV cache | Source |
|---|---|---|---|---|
| BF16 baseline | BF16 | BF16 | BF16 | industry default, illustrative |
| W4A16 (e.g. AWQ) | BF16 act, INT4 weights | BF16 act, INT4 weights | BF16 | per-recipe, varies by setup |
| W4A4 (LongLive-2.0) | NVFP4 | NVFP4 | NVFP4 | LongLive-2.0 explainer |
| Mix-Quant (this paper) | NVFP4 | BF16 | BF16 | arXiv 2605.20315 |
(Rows are headline configurations as reported by each paper or release; real-world deployments can swap KV-cache formats independently.)
Where it earns its keep — and where it does not
The win is loudest when prefill share is high: long prompts, agentic tool-loops that re-prefill cached prefixes, RAG with a lot of retrieved context. The win is smallest when decode dominates wall-clock — short prompts, long generations, low-batch decode. A 200-token-in / 4000-token-out request is mostly decode; Mix-Quant gives it almost nothing.
It also matters less when the prefill phase has already been mostly cached. Stacks with strong prefix caching — radix-tree or block-hash — skip the prefill compute for cache-hit prefixes, so there's less prefill left to accelerate. Mix-Quant and prefix caching compose: caching cuts how often prefill runs at all; Mix-Quant cuts how long each remaining prefill takes.
The recipe is also not a substitute for full W4A4: when memory is the binding constraint (very large models on small clusters), W4A4 for the weights everywhere still pays off, because the byte footprint matters even in decode. Mix-Quant's bet is that on hardware that's compute-rich (Blackwell) and serving short-decode agentic workloads, the asymmetry is real and you should exploit it.
Goes deeper in: LLM Internals → Quantization → Outliers · GPU & CUDA → Roofline → Prefill vs Decode
Related explainers
- LongLive-2.0 — NVFP4 W4A4 across training and inference — the symmetric W4A4 alternative
- vLLM v0.20 — FlashAttention 4 packing — the other prefill-phase optimization that ships in the same window
- vLLM v0.20 — TurboQuant 2-bit KV cache — symmetric KV-cache quantization, paired with this style of phase-asymmetric recipe at the system level
- PreFT — Prefill-only adapters — the LoRA analogue: an adapter active only during prefill