ThriftAttention — top-5% in FP16, rest in FP4

LLM
L
ThriftAttention — top 5% of QK blocks in FP16, rest in FP4Q (queries, long context)K (keys, long context) — 8×8 = 64 attention blocksone QK blockFP4 — low precision(~95% of blocks)importance score(cheap heuristic)top 5% → FP16(high precision)online-softmax mergeFP16 partialFP4 partialattention outputQuality-gap recovered (FP4 → FP16 ceiling)0%all-FP4 baseline11% 89.1%of the FP4→FP16 quality gap recovered — with only 5% of blocks in FP16
learnaivisually.com/ai-explained/thriftattention-importance-aware-fp4

The news. On May 23, 2026, the ThriftAttention paper landed on arXiv. It targets long-context inference on Blackwell GPUs and reports that, on average across the evaluated benchmarks, the method recovers 89.1% of the FP4 → FP16 long-context quality gap while computing only ~5% of QK blocks in FP16. The headline claim is that the advantage grows with sequence length — the regime where pure-FP4 attention falls off hardest.

Picture a copy editor handed a manuscript and asked to do two passes. They could grind every sentence with the same expensive gold pen — accurate, but slow and costly. They could blast through everything with a cheap black ballpoint — fast and cheap, but the headline lines that carry the meaning come out smudged. Or — the ThriftAttention move — they could glance at the page, mark the ~5% of lines that carry the most weight with the gold pen, and ballpoint the rest. The final page has both pens visible on it, but almost all the speed is the ballpoint's and almost all the precision is the gold pen's.

The mechanism translates directly. ThriftAttention runs in two stages inside a single fused FlashAttention-style kernel. Stage one is a cheap importance ranking: a coarse summary of the QK product picks the top ~5% of QK blocks — the blocks that, if computed sloppily, would hurt the final attention output the most. Stage two runs the actual attention math twice, with different precision per block. The marked top-5% gets full FP16 treatment; the remaining ~95% runs in FP4 on the Blackwell tensor cores. Both partial outputs then flow into an online-softmax accumulator — the same streaming pattern FlashAttention uses — which merges them into a single attention output. The merge is mathematically equivalent to having computed everything at the higher of the two precisions in each block, with no need to ever materialize the full N×N score matrix.

What's genuinely new is the axis of mixed precision. Classical mixed-precision schemes split precision across layers ("FFN in FP8, attention in FP16") or across operations ("weights in FP4, activations in BF16" — see the LongLive 2.0 W4A4 explainer for that pattern, and Mix-Quant's prefill BF16 / decode NVFP4 for splitting precision across the prefill/decode phase). ThriftAttention splits precision along a third axis: along the sequence dimension, block by block. Two QK blocks in the same attention call, on the same layer, on the same head run at different precisions. The importance heuristic decides the split per-token-chunk-pair, not per-tensor.

The cost of the heuristic itself is the load-bearing detail. If picking the top-5% blocks costs more than the savings from running 95% of them in FP4, the scheme loses. The paper's claim is that the importance score is a coarse FP4-or-cheaper summary of the QK product — cheaper than computing a single FP16 block — so the ranking pays for itself once per attention call. The same FlashAttention I/O-aware design (one HBM pass, blocks loaded into SRAM, online softmax streaming) holds; the extra work is the ranking pass plus a small bookkeeping overhead to route blocks to the right precision pipeline. FP4 dominates the arithmetic budget for 95% of blocks, so kernel-level throughput stays close to a pure-FP4 baseline.

Where ThriftAttention earns its keep

Hold three variables fixed and walk the arithmetic. One model. One Blackwell B200 attached to a 32K-token prompt at decode. One attention head. The attention call processes roughly (32K / 64)² ≈ 262,000 QK blocks at a 64×64 block size (illustrative — exact block count depends on the kernel's tiling configuration). Pure-FP16 attention runs all 262K blocks at full precision — call that quality 100% and cost 1.00× a normalized FP16 unit. Pure-FP4 attention runs all 262K blocks in FP4: bandwidth and tensor-core throughput shoot up by roughly 4× (illustrative — depends on the kernel's memory-vs-compute mix), but the long-context quality drops to ~11% of the gap recovered — meaning 89% of the way down from FP16 toward "noticeably degraded." With ThriftAttention, the top ~13,100 blocks (5% of 262K) run in FP16; the remaining ~248,900 run in FP4. Cost is dominated by the FP4 95%, so the throughput stays near pure-FP4 — roughly 3.6×–3.8× FP16-equivalent at the kernel level (setup-dependent). Quality lands at 89.1% of the gap recovered — within 11 percentage points of the pure-FP16 ceiling, instead of the 89-point loss the all-FP4 baseline takes. The headline win is most of the FP4 speed at most of the FP16 quality, with a knob (the 5% budget) to slide.

Standard AttentionHBM (slow)load all ↓write back ↑load all ↓SRAM (fast)
Multiple round trips for all data
Flash AttentionHBM (slow)K₁V₁K₂V₂K₃V₃K₄V₄SRAM (fast)K₁V₁→ compute → next
One tile at a time — click blocks to see ↑
Same result — fewer memory round trips
SettingQuality (gap recovered)Throughput (vs FP16)Comment
Pure FP16 attention100% (the ceiling)1.0×baseline; long-context-clean; expensive (per Peng et al.)
Pure FP4 attention~11% recovered~3.8–4.2× (illustrative)fast but long-context quality falls off (setup-dependent)
ThriftAttention (top-5% FP16, rest FP4)89.1% recovered (on average)~3.6–3.8× (illustrative)most of the FP4 speed at most of the FP16 quality (reported)
Top-k FP16 budget (k slider)monotonic in kmonotonic in k~5% is the sweet spot the paper reports; k is a knob (reported)

A small caveat. ThriftAttention is an inference-side claim — the paper evaluates it as a drop-in attention kernel replacement at inference, not as a training recipe. Whether the same importance-aware split survives the gradient pass is an open question; the authors are careful to scope the claim to inference. The other open question is how the win scales below long context. The headline 89.1% number is averaged across the evaluated benchmarks; at short context the FP4 gap is small to begin with, so the absolute recovery from the FP16 top-5% is smaller too — the technique earns its keep specifically where pure-FP4 attention hurts most. The deeper lesson is that the "axis" of mixed precision is no longer fixed at the layer or operation boundary. Once a kernel can route precision per-block, importance — the cheap heuristic that picks which blocks are which — becomes a first-class kernel-design knob next to block size and tiling.

Goes deeper in: LLM Internals → Attention → Output

Related explainers

Frequently Asked Questions