What is FlashAttention 4 packing?

FA4 packing — formally 'packed variable-length attention' — concatenates the sequences in a batch into one continuous block and uses a block-diagonal mask to keep them isolated, instead of padding every sequence out to the longest. A single fused kernel then processes the whole batch in one launch, without spending compute on padding slots.

Why does padding waste GPU compute?

Batched attention kernels expect a uniform sequence length so the schedule is regular. The classic fix is to pad shorter sequences with dummy tokens up to the longest, but the kernel still processes those padding slots — it just throws away the result. With variable-length batches, the wasted ratio is the gap between average length and max length, which can easily exceed half the work.

How does FA4 differ from FA3?

FA3 added asynchronous TMA loads and warp specialization on Hopper to overlap K/V transfers with the matmul. FA4 keeps that and tightens the pipeline further (closing leftover bubbles), but its headline change is that packed variable-length attention becomes a first-class layout — not a workaround — so vLLM can schedule a mixed-length batch through one kernel without padding.

vLLM v0.20 — FlashAttention 4 packing

LLM

learnaivisually.com/ai-explained/vllm-v0-20-fa4-packing

TL;DR

What is it: FlashAttention 4 packing — formally “packed variable-length attention” — is the headline addition in vLLM v0.20. One fused kernel processes a whole batch of mixed-length sequences as a single back-to-back stream, with block-diagonal dividers keeping groups isolated.
Why it’s needed: Production serving batches mix short and long sequences. Padding every sequence to the longest wastes the gap between average and max length on dummy compute the kernel still walks past — easily 50%+ of attention work on realistic batches.
vs previous: Before FA4, padded mode was the only way to keep batched attention’s schedule regular; the kernel dutifully processed empty padding slots. FA4 makes packed layouts first-class — the kernel only touches real tokens, and a worked 8-sequence batch jumps from ~31% to ~100% utilization.

Jargon

FlashAttention: A family of GPU kernels that compute attention tile by tile in on-chip SRAM, avoiding the expensive round-trip of materializing the full N×N score matrix in HBM. Each version (FA1–FA4) tightens the pipeline further.
Padded vs packed: In padded mode, every sequence in a batch is zero-padded to the maximum length so the kernel can use a uniform schedule — wasting compute on dummy slots. In packed mode (FA4), sequences are concatenated back-to-back and a block-diagonal mask keeps them isolated, so the kernel only touches real tokens.
Block-diagonal mask: An attention mask where each sequence can only attend to its own tokens. When sequences are concatenated into one stream, the mask forms blocks along the diagonal — one block per sequence — preventing cross-sequence attention leakage.
Online softmax: A numerically stable algorithm that folds each attention tile into a running result without waiting for all tiles — enabling tiled attention without materializing the full score matrix. Introduced in FA1 and carried through every subsequent version.
HBM: High Bandwidth Memory — the large but relatively slow off-chip DRAM on a GPU (e.g. 80 GB on H100). Reading/writing HBM is the primary bottleneck in memory-bound kernels like attention. FlashAttention is designed to minimize HBM traffic.
TMA: Tensor Memory Accelerator — a hardware unit on Hopper (H100) GPUs that transfers data between HBM and shared memory asynchronously, freeing CUDA cores to keep computing. FA3 introduced TMA-based K/V prefetching; FA4 tightens the overlap further.
Warp specialization: A Hopper scheduling technique where different warps handle different pipeline stages (e.g. one warp issues TMA loads while another runs the matmul), hiding memory latency behind compute. Used in FA3 and FA4.
KV cache: The per-request store of past key and value tensors that lets autoregressive decoding reuse prior computations. FA4 packed attention is designed so KV cache reads align with the merged token stream, eliminating per-sequence re-padding at decode time.

The news. In April 2026, vLLM v0.20.0 shipped with FlashAttention 4 as the new default attention backend, alongside CUDA 13.0, PyTorch 2.11, and day-0 support for DeepSeek V4 and Hunyuan v3. FA4's headline addition is packed variable-length attention: a single fused kernel that processes a whole batch of mixed-length sequences without padding them to the longest one. Read the release →

Picture eight checkout lines. Customers keep arriving at different lines and each line is a different length. The old way of running attention is to pad every line out to the longest one with empty slots — the cashier still walks past every slot, doing nothing at the empty ones, just to keep the schedule uniform. That uniform-schedule trick is what makes batched GPU attention easy to pipeline, but it means a batch with one long sequence forces every short sequence to pretend it has the same length.

FA4 packing rewires the floor. All customers from all lines merge into one continuous express lane, ordered back-to-back, with block-diagonal dividers marking where one group ends and the next begins. The cashier — the attention kernel — walks the merged lane once, processes only real customers, and uses the dividers to make sure no group's tokens accidentally attend to another group's tokens. The fused kernel handles the whole batch in one launch.

The underlying engine is still classic FlashAttention: compute attention tile by tile on-chip, never materialize the full N×N score matrix in HBM, fold each tile into a running result with the online softmax trick. FA4 keeps all of that and adds two things — tighter pipelining of K/V loads with the matmul (closing FA3's bubbles on Hopper) and first-class support for packed layouts so KV cache reads line up with the merged lane.

Tiling: load once into shared memory, reuse many times

Tile of A

16×16

Tile of B

16×16

→

Shared Mem

SRAM

read from SRAM

256 threads

HBM reads: 2 tiles (512 floats)

256 × full rows

Each FlashAttention version has pushed the same core idea further:

Version	Year	Key contribution	Hardware target
FA1	2022	Tile + on-chip SRAM + online softmax. No N×N HBM round-trip; recompute on backward pass.	A100 / general
FA2	2023	Better warp-level scheduling. Keeps tensor cores busy across the tile loop instead of idling between tiles.	A100 / H100
FA3	2024	Asynchronous TMA + warp specialization. Overlaps K/V loads with the matmul to hide memory latency.	H100 (Hopper)
FA4	2026	Tighter K/V-load pipelining (closes FA3's bubbles) and first-class packed variable-length attention for batched serving.	H100 / B200

Consider a batch of 8 sequences with lengths ranging from 500 to 4000 tokens, averaging 2500, inside an 8K context window. In padded mode, every sequence is padded to 8K — 64K slots processed, 20K of which are real tokens. In packed mode, FA4 packs all 20K real tokens into one kernel call and processes exactly that many slots. The compute saved is proportional to the padding ratio: roughly 31% utilization in padded mode becomes 100% in packed mode, a meaningful throughput gain at no quality cost.

Goes deeper in: LLM Internals → Attention → Computing Attention Scores

Related explainers

vLLM v0.20 — TurboQuant 2-bit KV cache

vLLM v0.20 — FlashAttention 4 packing

Related explainers

Frequently Asked Questions