The news. In April 2026, vLLM v0.20.0 shipped with FlashAttention 4 as the new default attention backend, alongside CUDA 13.0, PyTorch 2.11, and day-0 support for DeepSeek V4 and Hunyuan v3. FA4's headline addition is packed variable-length attention: a single fused kernel that processes a whole batch of mixed-length sequences without padding them to the longest one. Read the release →
Picture eight checkout lines. Customers keep arriving at different lines and each line is a different length. The old way of running attention is to pad every line out to the longest one with empty slots — the cashier still walks past every slot, doing nothing at the empty ones, just to keep the schedule uniform. That uniform-schedule trick is what makes batched GPU attention easy to pipeline, but it means a batch with one long sequence forces every short sequence to pretend it has the same length.
FA4 packing rewires the floor. All customers from all lines merge into one continuous express lane, ordered back-to-back, with block-diagonal dividers marking where one group ends and the next begins. The cashier — the attention kernel — walks the merged lane once, processes only real customers, and uses the dividers to make sure no group's tokens accidentally attend to another group's tokens. The fused kernel handles the whole batch in one launch.
The underlying engine is still classic FlashAttention: compute attention tile by tile on-chip, never materialize the full N×N score matrix in HBM, fold each tile into a running result with the online softmax trick. FA4 keeps all of that and adds two things — tighter pipelining of K/V loads with the matmul (closing FA3's bubbles on Hopper) and first-class support for packed layouts so KV cache reads line up with the merged lane.
Tiling: load once into shared memory, reuse many times
Tile of A
16×16
+
Tile of B
16×16
→
Shared Mem
SRAM
read from SRAM
256 threads
HBM reads: 2 tiles (512 floats)
256 × full rows
Each FlashAttention version has pushed the same core idea further:
Version
Year
Key contribution
Hardware target
FA1
2022
Tile + on-chip SRAM + online softmax. No N×N HBM round-trip; recompute on backward pass.
A100 / general
FA2
2023
Better warp-level scheduling. Keeps tensor cores busy across the tile loop instead of idling between tiles.
A100 / H100
FA3
2024
Asynchronous TMA + warp specialization. Overlaps K/V loads with the matmul to hide memory latency.
H100 (Hopper)
FA4
2026
Tighter K/V-load pipelining (closes FA3's bubbles) and first-class packed variable-length attention for batched serving.
H100 / B200
Consider a batch of 8 sequences with lengths ranging from 500 to 4000 tokens, averaging 2500, inside an 8K context window. In padded mode, every sequence is padded to 8K — 64K slots processed, 20K of which are real tokens. In packed mode, FA4 packs all 20K real tokens into one kernel call and processes exactly that many slots. The compute saved is proportional to the padding ratio: roughly 31% utilization in padded mode becomes 100% in packed mode, a meaningful throughput gain at no quality cost.