NVFP4 is NVIDIA's 4-bit floating-point tensor format. Each element carries a sign and a small magnitude in 4 bits, and a group of consecutive elements shares one extra scale factor so the block as a whole retains dynamic range. The format is exposed on Blackwell tensor cores and unlocks the highest peak FLOP rate currently advertised on those tensor cores. LongLive-2.0 uses NVFP4 across both weight/activation matmuls (W4A4) and the KV cache, eliminating the FP16 staging zones earlier recipes had to keep around.

Why apply NVFP4 to training, not just inference?

Earlier recipes quantize only at deploy time because training was thought to need wider dynamic range for gradients. LongLive-2.0 keeps the format consistent across training and inference so it never has to be swapped at the training-to-deploy boundary. That single envelope avoids the FP16 staging zones where bandwidth wins evaporate, and lets the training GEMM ride the Blackwell tensor-core FP4 peak just like inference does. The paper reports a 2.15× training speedup against FP16 baselines on a 5B-parameter long-video diffusion model.

How does this differ from FP8 KV cache or W4A16 weight quantization?

FP8 KV cache (TensorRT-LLM) and W4A16 weight-only inference quantization (e.g. AWQ) compress one numerical surface while leaving others in FP16. Each leftover FP16 surface is a bandwidth bottleneck. LongLive-2.0 closes all three at once — training GEMM, inference GEMM, and KV cache — with a single 4-bit format, so the savings compound rather than being undone at each boundary. The paper's headline numbers (45.7 FPS, 2.15× training, 1.84× inference) come from that end-to-end coverage.

LongLive-2.0 — NVFP4 W4A4 across training and inference

GPU

learnaivisually.com/ai-explained/longlive-2-0-nvfp4-w4a4-training-inference

TL;DR

What is it: The LongLive-2.0 system from NVIDIA Research is described as the first end-to-end stack to run both training and inference of a long-form video generation model in NVFP4 — a 4-bit floating-point format with per-block scaling — covering matrix multiplies and the KV cache.
Why it’s needed: Long-video diffusion is throughput-starved: a 5B model in FP16 sees its KV cache and inter-GPU activation traffic dominate per-frame cost, so quantizing every numerical surface to NVFP4 reportedly delivers 2.15× training speedup, 1.84× inference speedup, and 45.7 FPS at the 2-step generation setting.
vs previous: Earlier low-precision recipes apply FP8 to inference only (TensorRT-LLM FP8 KV cache) or BF16/FP8 to training only (NVIDIA Transformer Engine), so the format swaps mid-pipeline; LongLive-2.0 keeps a single NVFP4 envelope across the full training+inference stack, removing the FP16 fallback regions where bandwidth previously bottlenecked.

Jargon

NVFP4: NVIDIA's 4-bit floating-point tensor format with per-block scaling — a small block of values shares one shared scale factor, so each individual element carries only sign plus a few bits of magnitude. The shared scale recovers dynamic range that a flat 4-bit float would lose, which is what makes the format usable for matrix-multiply inputs rather than only for compressed weight storage.
W4A4: The convention W{n}A{m} describes the precision of weights (W) and activations (A) into a GEMM. W4A4 means BOTH inputs to the matrix multiply are 4-bit. Older recipes like W4A16 keep activations in FP16 — they save weight memory but still pay FP16 for the activation tile streaming through the tensor core, so they don't get the full peak FLOP rate.
KV cache: The stored key and value tensors from every previous token, kept around so each new token's attention can read them without recomputing. At long context the cache dominates memory and inter-GPU traffic — quantizing it to NVFP4 attacks that dominant cost directly.
Per-block scaling: Instead of one global scale per tensor, a block-scaled format groups consecutive values into small blocks and stores one shared scale per block. Outlier-heavy blocks get a large scale, calm blocks get a small one, preserving precision where it matters. NVFP4 follows this block-FP idea, akin to the open MX / block-scaled quantization family.
GEMM: General matrix multiplication — the dominant operation in every transformer forward and backward pass. On Blackwell tensor cores the GEMM input format determines the achievable peak FLOP rate; NVFP4 unlocks the highest peak rate currently exposed.
Denoising steps: A diffusion model generates output by iteratively refining a noisy tensor — each denoising step is a full forward pass. LongLive-2.0 is distilled from 4 denoising steps per frame down to 2; the 45.7 FPS figure is measured at the 2-step setting, and the NVFP4 stack makes each of those iterations cheap.
Long-video diffusion: Generative video models like LongLive run multiple denoising passes per frame across many frames, accumulating a long KV history. Per-frame throughput is the metric that decides whether the model can run interactively — 45.7 FPS at 5B params is the headline payoff for LongLive-2.0.

The news. On May 18, 2026, NVIDIA researchers posted LongLive-2.0 to arXiv, described as the first NVFP4 parallel training and inference stack for long-form video generation. On a 5B-parameter model the system reaches 45.7 FPS at the 2-step generation setting, 2.15× training speedup, and 1.84× inference speedup, with NVFP4 applied to weight/activation matrix multiplies (W4A4) and the KV cache — shrinking inter-GPU communication along the way.

Picture the library. The fiction wing restocks constantly — new arrivals, returns, reshelving — and that traffic is training. The reference wing gets scanned by every visitor, page by page, with no waiting allowed — that is inference. And the stack-room archive in the basement just grows: an ever-longer row of volumes recording everything that has ever come through, ready for whoever asks — that is the KV cache. In FP16, every book on every shelf is a hardcover: thick spines, heavy carts, four to a shelf. Worse, every wing has to use the same hardcover format, so the cart-paths between wings move thick books too. The branch library next door also gets hardcovers shipped in — that is the inter-GPU traffic dominating Blackwell's high-bandwidth memory and NVLink between chips.

LongLive-2.0 swaps every shelf — fiction, reference, and archive — for paperbacks in one coordinated move. The paperback is NVFP4: a 4-bit floating-point number with a small group of neighbours sharing one extra scale factor so the group as a whole keeps its dynamic range. Sixteen paperbacks now fit where four hardcovers used to, on every shelf in every wing. The wings still hand things off the same way they always did; it is just that the books moving between them are paperback now. The carts move faster, the branch shipments are quarter-weight, and the archive in the basement is one-quarter as long for the same number of recorded volumes.

The crucial design choice is that the format does not swap mid-pipeline. Earlier low-precision recipes are partial — TensorRT-LLM keeps weights in INT4 but does the GEMM in FP16, NVIDIA's Transformer Engine pushes BF16 or FP8 through training but inference servers fall back to FP16 around the KV cache. Each handoff is a place where the precision envelope expands again, and bandwidth bottlenecks re-emerge at exactly those expansion points. LongLive-2.0's stack keeps one NVFP4 envelope around the entire library — training GEMM, inference GEMM, and the KV cache — so the bandwidth wins compound instead of being undone at each boundary.

Recipe	Training	Inference GEMM	KV cache	Format swap?
Vanilla FP16 baseline	FP16 / BF16	FP16	FP16	No
FP8 inference (TensorRT-LLM, vendor docs)	BF16	FP8 W8A8	FP8 KV	Yes — BF16 → FP8 at deploy
W4A16 inference quant (e.g. AWQ)	FP16 / BF16	INT4 weight × FP16 activation	FP16	Yes — weights only
LongLive-2.0 (this work, ~bibliographic)	NVFP4 W4A4	NVFP4 W4A4	NVFP4	No

A worked-example sense of where the savings come from. Imagine a single attention layer at a 64,000-token context with 128 KV heads at head dim 128, in FP16 (illustrative shape, paper does not disclose the LongLive-2.0 KV geometry). The KV cache footprint is 2 × 64,000 × 128 × 128 × 2 bytes ≈ 4.0 GB per layer in FP16. Re-quantizing the cache to NVFP4 drops the per-element storage from 2 bytes to 0.5 bytes plus a small per-block scale overhead — call it ~0.55 bytes effective. The cache becomes 2 × 64,000 × 128 × 128 × 0.55 ≈ 1.1 GB — ~3.6× smaller. The same factor shrinks every inter-GPU transfer that carries KV pages between sharded attention workers, which is why the paper reports the inter-GPU communication overhead drops alongside the headline 1.84× inference speedup (illustrative geometry, the 1.84× figure is the paper's measurement; the 3.6× shrink approximates the paper's headline "~4× KV footprint reduction" claim). The training side reaches 2.15× speedup because every backward-pass GEMM also runs through the same NVFP4 path on Blackwell tensor cores, where FP4 unlocks the peak FLOP rate.

The pedagogical lesson is that an end-to-end precision envelope is structurally different from a chain of point quantizations. Each format swap inside an LLM pipeline introduces an FP16 staging zone — and inside that zone the bandwidth advantages of low precision evaporate. LongLive-2.0's contribution is less about the per-element format (NVFP4 itself is a public NVIDIA primitive) and more about closing the last FP16 gap: the KV cache, where most prior recipes stop short. Whether the same envelope generalizes from long-form video diffusion to other long-context architectures is an open question; the paper's empirical demonstration is the 5B video model, not LLM decoder-only inference.

Goes deeper in: GPU & CUDA → Tensor Cores → Precision Formats

Related explainers

SOP paper — Hardware-aware per-layer PTQ at FP6 — a different precision choice (FP6) made per layer, contrasting with the uniform NVFP4 envelope.
vLLM v0.20 — TurboQuant 2-bit KV cache — block-scaled KV quantization at 2 bits, the inference-only counterpart to LongLive-2.0's KV path.
QCA paper — Outlier injection across AWQ/GPTQ/GGUF — how block-scaled formats interact with single-value outliers, the same family of quantization risks NVFP4 inherits.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based