Mix-Quant is a May 2026 arXiv paper (Haiquan Lu, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang) that proposes phase-asymmetric quantization for LLM inference: run the prefill matmul at NVFP4 (NVIDIA's 4-bit floating-point format) and keep decode in BF16, end-to-end with the same set of weights. The paper reports up to 3× prefill speedup with preserved task performance across long-context and agentic benchmarks. The recipe sits between W4A16 weight-only quantization (which leaves activations alone) and W4A4 everywhere (LongLive-2.0).

Why quantize prefill but not decode?

Prefill and decode sit on opposite sides of the roofline. Prefill processes the entire prompt in one matmul — arithmetic intensity is high, the tensor cores are pegged, and the workload is compute-bound. Faster math (lower precision NVFP4 on Blackwell tensor cores) lifts the compute ceiling and lifts prefill throughput with it. Decode generates one token at a time, loading the full weight matrix from HBM for each token — arithmetic intensity is low, the workload is memory-bandwidth-bound, and the tensor cores were idle anyway. Faster math does little for decode while introducing precision-induced drift that compounds across the generated tokens. Spending the precision drop where the reward is large (prefill) and the risk is small (no compounding drift) is the whole idea.

How is Mix-Quant different from W4A4 across the whole stack?

W4A4 recipes like LongLive-2.0 use the same 4-bit format end-to-end — prefill, decode, and KV cache — and accept the decode-side quality risk to keep one number format across the stack. Mix-Quant only takes the prefill win and keeps BF16 in decode, so it gives up the (small) bandwidth saving on decode in exchange for keeping the decode-side numerics at the un-quantized format. The two recipes target different parts of the same trade space: W4A4 is best when memory footprint is the binding constraint (very large models on small clusters); Mix-Quant is best on compute-rich hardware (Blackwell) serving short-decode agentic workloads where prefill is a large share of wall-clock.

Mix-Quant paper — NVFP4 prefill + BF16 decode

Mix-Quant — NVFP4 prefill + BF16 decode

LLM

learnaivisually.com/ai-explained/mix-quant-nvfp4-prefill-bf16-decode

TL;DR

What is it: The Mix-Quant paper proposes phase-asymmetric quantization: run the prefill GEMM at NVFP4 (NVIDIA's 4-bit floating-point format) and keep decode at BF16. The same model, the same request — but a different number format per phase.
Why it’s needed: Prefill and decode have very different compute profiles. Prefill is one big compute-bound matmul; decode is many small memory-bandwidth-bound per-token matmuls. Dropping precision speeds up the compute-bound phase a lot; on the bandwidth-bound phase it does almost nothing — and risks quality. So spend the precision drop where it pays.
vs previous: Prior W4A4 recipes like LongLive-2.0 use the same 4-bit format for both phases, accepting the decode-side quality risk to keep one number format across the stack. Mix-Quant takes the opposite trade: keep BF16 where 4-bit would barely help anyway, and only collect the prefill win.

Jargon

NVFP4: NVIDIA's 4-bit floating-point tensor format. Each element carries a sign and small magnitude in 4 bits, and a group of consecutive elements shares one extra scale factor. Exposed on Blackwell tensor cores at the highest peak FLOP rate currently advertised. Full NVFP4 explainer →
BF16: Brain Floating Point, 16-bit. Same exponent range as FP32 with reduced mantissa precision. The current default activation format on most serving stacks because it tolerates outliers better than FP16.
prefill: The first inference phase: the model processes the entire prompt in one matmul over the full input sequence. Arithmetic intensity is high — many FLOPs per byte loaded.
decode: The second inference phase: the model generates output tokens one at a time, each forward pass loading the full weights for a tiny matmul. Arithmetic intensity is low — few FLOPs per byte loaded.
arithmetic intensity (AI): FLOPs done per byte of memory traffic. A workload's AI plus the hardware's compute and memory-bandwidth roofs determine whether it is compute-bound or bandwidth-bound — the roofline model formalizes this.
compute-bound: The workload waits on the tensor cores. Faster math (lower precision, wider tensor cores) speeds it up. Prefill lives here.
bandwidth-bound: The workload waits on HBM reads/writes. Faster math doesn't help because the tensor cores were idle anyway. Decode lives here.
phase-asymmetric quantization: The paper's umbrella term for the recipe: use one number format in prefill and a different one in decode, picked per-phase by which roofline regime that phase sits in.

The news. On May 19, 2026, a five-author group (Haiquan Lu, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang) posted Mix-Quant to arXiv. The paper applies NVFP4 quantization to the compute-bound prefilling phase while preserving BF16 during decoding, and reports up to 3× prefill speedup with task performance largely preserved across long-context and agentic benchmarks. The recipe sits between two prior camps: W4A16 weight-only (which leaves activations alone) and W4A4 everywhere (LongLive-2.0). Read the paper →

Why prefill and decode aren't the same problem

Picture the bakery. The press stamps out a whole tray of cookies in one squeeze — the harder the press, the faster the tray gets done. The piping bag frosts cookies one at a time, and however strong your grip is, the next cookie still has to travel down the conveyor before you can touch it. The press is compute-bound. The piping bag is conveyor-bound.

Inference works the same way. Prefill processes the entire prompt — say, 8,000 tokens — in one matmul. The weights are loaded from HBM once and reused across 8,000 tokens' worth of work; arithmetic intensity is high, and the tensor cores are pegged. Decode is the opposite: for every output token, the model loads its full weight matrix from HBM and does one token's worth of math. The tensor cores spend most of their cycles waiting for weights to arrive.

The roofline model draws this with two dots on one chart. Prefill sits up under the compute ceiling, where peak FLOP rate sets the speed limit. Decode sits down on the bandwidth slope, where HBM bytes-per-second sets the speed limit. Raising the compute ceiling pulls prefill up. Decode doesn't move — its ceiling isn't the ceiling it's standing under.

Prefill vs Decode on the Roofline

Same GPU, fundamentally different bottlenecks

What NVFP4 actually changes — and what it doesn't

NVFP4 is a block-scaled 4-bit float exposed on Blackwell tensor cores. Versus BF16, two things happen at once: the matmul does roughly 4× more FLOPs per second on the same tensor cores (that's the "stronger press handle"), and each weight or activation occupies 4 bits instead of 16 — so the HBM bytes per matmul also drop.

For the press — prefill — both of those wins compound. The matmul ran out of compute, so 4× faster math hits the wall-clock. The paper reports up to 3× prefill speedup at NVFP4 vs uniform BF16 — the headline number lands just below the 4× theoretical because dequant overhead, attention kernels at higher precision, and end-to-end overheads eat some of the budget. The remaining gap shows up as preserved task performance, which is the entire point.

For the piping bag — decode — the same recipe is unappealing. Decode was already waiting on HBM bytes, so the lower byte footprint of NVFP4 weights does help bandwidth a bit (less data to read per token). But the 4× faster math doesn't help at all: the tensor cores were idle. Meanwhile decode is where output-token quality is decided: each token conditions on all the prior tokens, so any precision-induced drift compounds with horizon length. The reward for quantizing decode is small. The risk is large. Spend the precision budget where the reward is large and the risk is small.

Same model, two number formats

This is where the recipe stops being purely an inference trick. Mix-Quant runs at NVFP4 for prefill and BF16 for decode — the same set of weights, with a calibration step that aligns the two regimes across the request lifetime (per the paper's setup notes).

That phase-boundary plumbing is the part that's specific to Mix-Quant. The high-level idea — pick the precision per phase, not per model — generalizes to other splits: weights vs activations, attention vs FFN, prompt-cache prefill vs uncached prefill. The paper's contribution is making the prefill/decode split work cleanly end-to-end with a single set of weights.

Where the recipe earns its keep is agentic LLM workloads — long contexts, many tool calls, lots of cached prompt prefixes — where prefill is a big slice of wall-clock. The longer the prefill share, the larger the end-to-end benefit. For a typical 8,000-token-in / 200-token-out request, prefill is roughly 10–25% of wall-clock on a single GPU (this band is (setup-dependent, illustrative)); cut that to a third and the request finishes ~7–17% faster. For agentic chains that re-do prefill on every step, the per-step saving stacks across the chain.

Recipe	Prefill precision	Decode precision	KV cache	Source
BF16 baseline	BF16	BF16	BF16	industry default, illustrative
W4A16 (e.g. AWQ)	BF16 act, INT4 weights	BF16 act, INT4 weights	BF16	per-recipe, varies by setup
W4A4 (LongLive-2.0)	NVFP4	NVFP4	NVFP4	LongLive-2.0 explainer
Mix-Quant (this paper)	NVFP4	BF16	BF16	arXiv 2605.20315

(Rows are headline configurations as reported by each paper or release; real-world deployments can swap KV-cache formats independently.)

Where it earns its keep — and where it does not

The win is loudest when prefill share is high: long prompts, agentic tool-loops that re-prefill cached prefixes, RAG with a lot of retrieved context. The win is smallest when decode dominates wall-clock — short prompts, long generations, low-batch decode. A 200-token-in / 4000-token-out request is mostly decode; Mix-Quant gives it almost nothing.

It also matters less when the prefill phase has already been mostly cached. Stacks with strong prefix caching — radix-tree or block-hash — skip the prefill compute for cache-hit prefixes, so there's less prefill left to accelerate. Mix-Quant and prefix caching compose: caching cuts how often prefill runs at all; Mix-Quant cuts how long each remaining prefill takes.

The recipe is also not a substitute for full W4A4: when memory is the binding constraint (very large models on small clusters), W4A4 for the weights everywhere still pays off, because the byte footprint matters even in decode. Mix-Quant's bet is that on hardware that's compute-rich (Blackwell) and serving short-decode agentic workloads, the asymmetry is real and you should exploit it.

Goes deeper in: LLM Internals → Quantization → Outliers · GPU & CUDA → Roofline → Prefill vs Decode

Related explainers

LongLive-2.0 — NVFP4 W4A4 across training and inference — the symmetric W4A4 alternative
vLLM v0.20 — FlashAttention 4 packing — the other prefill-phase optimization that ships in the same window
vLLM v0.20 — TurboQuant 2-bit KV cache — symmetric KV-cache quantization, paired with this style of phase-asymmetric recipe at the system level
PreFT — Prefill-only adapters — the LoRA analogue: an adapter active only during prefill

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based