LongLive-2.0 — NVFP4 W4A4 across training and inference

GPU
L
One NVFP4 envelope across training GEMM, inference GEMM, and KV cacheTRAINING GEMMforward + backward passFP16 / BF16100% baselineINFERENCE GEMMdecode-time matmulFP16100% baselineKV CACHElong-context storageFP16 cache100% baselinetile width = bits per element16 bits (FP16 / BF16)
learnaivisually.com/ai-explained/longlive-2-0-nvfp4-w4a4-training-inference

The news. On May 18, 2026, NVIDIA researchers posted LongLive-2.0 to arXiv, described as the first NVFP4 parallel training and inference stack for long-form video generation. On a 5B-parameter model the system reaches 45.7 FPS at the 2-step generation setting, 2.15× training speedup, and 1.84× inference speedup, with NVFP4 applied to weight/activation matrix multiplies (W4A4) and the KV cache — shrinking inter-GPU communication along the way.

Picture the library. The fiction wing restocks constantly — new arrivals, returns, reshelving — and that traffic is training. The reference wing gets scanned by every visitor, page by page, with no waiting allowed — that is inference. And the stack-room archive in the basement just grows: an ever-longer row of volumes recording everything that has ever come through, ready for whoever asks — that is the KV cache. In FP16, every book on every shelf is a hardcover: thick spines, heavy carts, four to a shelf. Worse, every wing has to use the same hardcover format, so the cart-paths between wings move thick books too. The branch library next door also gets hardcovers shipped in — that is the inter-GPU traffic dominating Blackwell's high-bandwidth memory and NVLink between chips.

LongLive-2.0 swaps every shelf — fiction, reference, and archive — for paperbacks in one coordinated move. The paperback is NVFP4: a 4-bit floating-point number with a small group of neighbours sharing one extra scale factor so the group as a whole keeps its dynamic range. Sixteen paperbacks now fit where four hardcovers used to, on every shelf in every wing. The wings still hand things off the same way they always did; it is just that the books moving between them are paperback now. The carts move faster, the branch shipments are quarter-weight, and the archive in the basement is one-quarter as long for the same number of recorded volumes.

The crucial design choice is that the format does not swap mid-pipeline. Earlier low-precision recipes are partial — TensorRT-LLM keeps weights in INT4 but does the GEMM in FP16, NVIDIA's Transformer Engine pushes BF16 or FP8 through training but inference servers fall back to FP16 around the KV cache. Each handoff is a place where the precision envelope expands again, and bandwidth bottlenecks re-emerge at exactly those expansion points. LongLive-2.0's stack keeps one NVFP4 envelope around the entire library — training GEMM, inference GEMM, and the KV cache — so the bandwidth wins compound instead of being undone at each boundary.

RecipeTrainingInference GEMMKV cacheFormat swap?
Vanilla FP16 baselineFP16 / BF16FP16FP16No
FP8 inference (TensorRT-LLM, vendor docs)BF16FP8 W8A8FP8 KVYes — BF16 → FP8 at deploy
W4A16 inference quant (e.g. AWQ)FP16 / BF16INT4 weight × FP16 activationFP16Yes — weights only
LongLive-2.0 (this work, ~bibliographic)NVFP4 W4A4NVFP4 W4A4NVFP4No

A worked-example sense of where the savings come from. Imagine a single attention layer at a 64,000-token context with 128 KV heads at head dim 128, in FP16 (illustrative shape, paper does not disclose the LongLive-2.0 KV geometry). The KV cache footprint is 2 × 64,000 × 128 × 128 × 2 bytes ≈ 4.0 GB per layer in FP16. Re-quantizing the cache to NVFP4 drops the per-element storage from 2 bytes to 0.5 bytes plus a small per-block scale overhead — call it ~0.55 bytes effective. The cache becomes 2 × 64,000 × 128 × 128 × 0.55 ≈ 1.1 GB~3.6× smaller. The same factor shrinks every inter-GPU transfer that carries KV pages between sharded attention workers, which is why the paper reports the inter-GPU communication overhead drops alongside the headline 1.84× inference speedup (illustrative geometry, the 1.84× figure is the paper's measurement; the 3.6× shrink approximates the paper's headline "~4× KV footprint reduction" claim). The training side reaches 2.15× speedup because every backward-pass GEMM also runs through the same NVFP4 path on Blackwell tensor cores, where FP4 unlocks the peak FLOP rate.

The pedagogical lesson is that an end-to-end precision envelope is structurally different from a chain of point quantizations. Each format swap inside an LLM pipeline introduces an FP16 staging zone — and inside that zone the bandwidth advantages of low precision evaporate. LongLive-2.0's contribution is less about the per-element format (NVFP4 itself is a public NVIDIA primitive) and more about closing the last FP16 gap: the KV cache, where most prior recipes stop short. Whether the same envelope generalizes from long-form video diffusion to other long-context architectures is an open question; the paper's empirical demonstration is the 5B video model, not LLM decoder-only inference.

Goes deeper in: GPU & CUDA → Tensor Cores → Precision Formats

Related explainers

Frequently Asked Questions