What is ThriftAttention?

ThriftAttention is a long-context attention kernel from Peng et al. (May 2026) that runs the top ~5% of QK attention blocks (selected by a cheap importance heuristic) in FP16, the remaining ~95% in FP4, then merges both partials through an online-softmax accumulator. It claims to recover, on average, 89.1% of the FP4→FP16 long-context quality gap — close to FP16 quality at close to FP4 throughput. It targets the Blackwell tensor-core FP4 pipeline.

Why is mixed-precision attention along the sequence dimension new?

Classical mixed-precision schemes split precision across layers ("attention in FP16, FFN in FP8") or across operations ("weights in FP4, activations in BF16"). ThriftAttention splits precision per-QK-block — two blocks in the same attention call, on the same head, can run at different precisions depending on a cheap importance score. That makes "importance" a first-class kernel-design knob alongside block size and tiling, and lets the speed/quality tradeoff slide on the FP16 budget (the "5%") rather than on a coarser boundary.

How does ThriftAttention compare to block-sparse attention like BigBird or Longformer?

Block-sparse approaches skip low-importance blocks entirely — the model sees a sparse subset of the attention pattern, and the dropped blocks contribute zero to the output. ThriftAttention computes every QK block, but in FP4 for the unimportant ones, so the full attention pattern is preserved and only the precision is uneven. That makes it a softer intervention than sparsity (which can change model behavior on adversarially-placed long-range dependencies) and complementary in principle: a sparse selection could be combined with a per-block FP4/FP16 split inside the kept blocks.

ThriftAttention paper — Importance-aware FP16/FP4 mixed-precision attention

ThriftAttention — top-5% in FP16, rest in FP4

LLM

learnaivisually.com/ai-explained/thriftattention-importance-aware-fp4

TL;DR

What is it: The ThriftAttention paper proposes a long-context attention kernel that runs the top ~5% of QK attention blocks (selected by a cheap importance heuristic) in FP16, the remaining ~95% in FP4, then recombines both partials with an online-softmax merge. Effectively a per-block mixed-precision attention.
Why it’s needed: FP4 attention is the bandwidth-and-energy prize on Blackwell tensor cores, but pure-FP4 attention degrades fast at long context — exactly the regime where teams want to push. A scheme that keeps most of the FP4 efficiency while paying for FP16 only on the few blocks that matter directly attacks the worst-case of long-context inference.
vs previous: Plain FP4 attention applies one precision uniformly across all QK blocks and loses accuracy at long context. Block-sparse approaches (e.g. BigBird, Longformer) skip low-importance blocks; ThriftAttention computes every block, but in FP4 for the un-important ones — so the model still sees the full attention pattern, just at uneven precision rather than at uneven sparsity.

Jargon

FP4 / FP16: 4-bit floating point and 16-bit floating point — the number formats used to store and multiply attention weights and activations. FP4 stores 16× fewer bits per number than FP16, which means more numbers fit in the same amount of memory and tensor-core throughput is much higher; the cost is much coarser precision. See Tensor Cores — Precision Formats.
QK block: A small rectangle of the Q × K attention score matrix — typically 64×64 or 128×128. FlashAttention-style kernels compute attention block-by-block so they never materialize the full N×N matrix in HBM; ThriftAttention works at the same block granularity.
FP4 quality gap: The accuracy drop you see when an FP4 attention kernel replaces an FP16 one, evaluated on a downstream task. The gap is small for short context but widens at long context — this is the gap ThriftAttention claims to close 89.1% of.
Importance score / heuristic: A cheap-to-compute proxy that ranks QK blocks by how much they're likely to contribute to the final attention output, run before the precision decision. The paper uses a coarse summary of the QK product itself — cheaper than computing the full block at FP16 — to pick the top ~5%.
Online softmax: A streaming reformulation of softmax that lets a kernel accumulate partial attention outputs one block at a time without ever holding the full score matrix in memory. ThriftAttention's two streams (FP16 partials + FP4 partials) are recombined through this same accumulator, so the output is mathematically equivalent to merging both at full precision.
Mixed precision: Running different parts of a computation at different numerical precisions for speed/quality tradeoffs. The classical version varies precision across layers or across operations; ThriftAttention's twist is varying precision within a single attention call, block by block, along the sequence dimension. See Mixed Precision on Tensor Cores.
Blackwell: NVIDIA's current-generation GPU architecture (B100/B200, GB200 NVL72), which adds first-class FP4 tensor-core support. Most "FP4 attention" papers — ThriftAttention included — target the Blackwell tensor-core pipeline.

The news. On May 23, 2026, the ThriftAttention paper landed on arXiv. It targets long-context inference on Blackwell GPUs and reports that, on average across the evaluated benchmarks, the method recovers 89.1% of the FP4 → FP16 long-context quality gap while computing only ~5% of QK blocks in FP16. The headline claim is that the advantage grows with sequence length — the regime where pure-FP4 attention falls off hardest.

Picture a copy editor handed a manuscript and asked to do two passes. They could grind every sentence with the same expensive gold pen — accurate, but slow and costly. They could blast through everything with a cheap black ballpoint — fast and cheap, but the headline lines that carry the meaning come out smudged. Or — the ThriftAttention move — they could glance at the page, mark the ~5% of lines that carry the most weight with the gold pen, and ballpoint the rest. The final page has both pens visible on it, but almost all the speed is the ballpoint's and almost all the precision is the gold pen's.

The mechanism translates directly. ThriftAttention runs in two stages inside a single fused FlashAttention-style kernel. Stage one is a cheap importance ranking: a coarse summary of the QK product picks the top ~5% of QK blocks — the blocks that, if computed sloppily, would hurt the final attention output the most. Stage two runs the actual attention math twice, with different precision per block. The marked top-5% gets full FP16 treatment; the remaining ~95% runs in FP4 on the Blackwell tensor cores. Both partial outputs then flow into an online-softmax accumulator — the same streaming pattern FlashAttention uses — which merges them into a single attention output. The merge is mathematically equivalent to having computed everything at the higher of the two precisions in each block, with no need to ever materialize the full N×N score matrix.

What's genuinely new is the axis of mixed precision. Classical mixed-precision schemes split precision across layers ("FFN in FP8, attention in FP16") or across operations ("weights in FP4, activations in BF16" — see the LongLive 2.0 W4A4 explainer for that pattern, and Mix-Quant's prefill BF16 / decode NVFP4 for splitting precision across the prefill/decode phase). ThriftAttention splits precision along a third axis: along the sequence dimension, block by block. Two QK blocks in the same attention call, on the same layer, on the same head run at different precisions. The importance heuristic decides the split per-token-chunk-pair, not per-tensor.

The cost of the heuristic itself is the load-bearing detail. If picking the top-5% blocks costs more than the savings from running 95% of them in FP4, the scheme loses. The paper's claim is that the importance score is a coarse FP4-or-cheaper summary of the QK product — cheaper than computing a single FP16 block — so the ranking pays for itself once per attention call. The same FlashAttention I/O-aware design (one HBM pass, blocks loaded into SRAM, online softmax streaming) holds; the extra work is the ranking pass plus a small bookkeeping overhead to route blocks to the right precision pipeline. FP4 dominates the arithmetic budget for 95% of blocks, so kernel-level throughput stays close to a pure-FP4 baseline.

Where ThriftAttention earns its keep

Hold three variables fixed and walk the arithmetic. One model. One Blackwell B200 attached to a 32K-token prompt at decode. One attention head. The attention call processes roughly (32K / 64)² ≈ 262,000 QK blocks at a 64×64 block size (illustrative — exact block count depends on the kernel's tiling configuration). Pure-FP16 attention runs all 262K blocks at full precision — call that quality 100% and cost 1.00× a normalized FP16 unit. Pure-FP4 attention runs all 262K blocks in FP4: bandwidth and tensor-core throughput shoot up by roughly 4× (illustrative — depends on the kernel's memory-vs-compute mix), but the long-context quality drops to ~11% of the gap recovered — meaning 89% of the way down from FP16 toward "noticeably degraded." With ThriftAttention, the top ~13,100 blocks (5% of 262K) run in FP16; the remaining ~248,900 run in FP4. Cost is dominated by the FP4 95%, so the throughput stays near pure-FP4 — roughly 3.6×–3.8× FP16-equivalent at the kernel level (setup-dependent). Quality lands at 89.1% of the gap recovered — within 11 percentage points of the pure-FP16 ceiling, instead of the 89-point loss the all-FP4 baseline takes. The headline win is most of the FP4 speed at most of the FP16 quality, with a knob (the 5% budget) to slide.

Setting	Quality (gap recovered)	Throughput (vs FP16)	Comment
Pure FP16 attention	100% (the ceiling)	1.0×	baseline; long-context-clean; expensive (per Peng et al.)
Pure FP4 attention	~11% recovered	~3.8–4.2× (illustrative)	fast but long-context quality falls off (setup-dependent)
ThriftAttention (top-5% FP16, rest FP4)	89.1% recovered (on average)	~3.6–3.8× (illustrative)	most of the FP4 speed at most of the FP16 quality (reported)
Top-k FP16 budget (k slider)	monotonic in k	monotonic in k	~5% is the sweet spot the paper reports; k is a knob (reported)

A small caveat. ThriftAttention is an inference-side claim — the paper evaluates it as a drop-in attention kernel replacement at inference, not as a training recipe. Whether the same importance-aware split survives the gradient pass is an open question; the authors are careful to scope the claim to inference. The other open question is how the win scales below long context. The headline 89.1% number is averaged across the evaluated benchmarks; at short context the FP4 gap is small to begin with, so the absolute recovery from the FP16 top-5% is smaller too — the technique earns its keep specifically where pure-FP4 attention hurts most. The deeper lesson is that the "axis" of mixed precision is no longer fixed at the layer or operation boundary. Once a kernel can route precision per-block, importance — the cheap heuristic that picks which blocks are which — becomes a first-class kernel-design knob next to block size and tiling.

Goes deeper in: LLM Internals → Attention → Output

Related explainers

vLLM v0.20 — FlashAttention 4 packing — the FlashAttention-4 substrate ThriftAttention sits on top of; FA4 made variable-length attention I/O-efficient, ThriftAttention adds the precision split.
Mix-Quant — NVFP4 prefill, BF16 decode — splits precision across the prefill/decode phase boundary; ThriftAttention splits along the sequence dimension instead.
LongLive 2.0 — NVFP4 W4A4 training + inference — pushes FP4 into training; complementary to ThriftAttention's inference-only block-level scheme.
I/O-optimal approximate attention — another "long-context, but cheaper" line of work, attacking the I/O axis rather than the precision axis.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based