The news. On June 6, 2026, the Chiaroscuro Attention paper (arXiv:2606.08327) introduced CHIAR-Former, a 4-layer hybrid transformer that routes each token to one of three operators — DCT spectral mixing, RBF kernel mixing, or full self-attention — based on its spectral entropy. On WikiText-103 it reports 36.54 validation perplexity with 62.5% fewer attention FLOPs and a 45% speedup over the full-attention baseline (PPL 66.62), while noting full attention still wins on small or synthetic datasets. Read the paper →

Picture the security line at an airport. Every traveler walks up to the same checkpoint, but they don't all need the same screening. A quick scan decides each person's lane: most are cleared into the PreCheck fast lane and through in seconds, and only a flagged few are pulled aside for the slow, thorough manual screening. The line moves fast because the expensive screening is rationed to the few who get flagged, not spent on everyone.

That rationing is exactly what Chiaroscuro does to self-attention. Ordinary attention is the slow manual screening applied to every token: each one is compared against every other, so the work grows with the square of the sequence length. Chiaroscuro adds a quick scan — each token's spectral entropy, a cheap read of how its signal spreads across frequencies — and uses that score to route. The paper finds that most natural-language tokens suit the cheap DCT spectral mixing lane, which blends features in the frequency domain (the same transform behind JPEG) with no token-to-token comparison; only a minority fall through to full attention scores. Because the cheap lane catches the bulk of the sequence, most tokens never pay the n² cost.

Embeddingtoken vector
× W_Q
QQuery
× W_K
KKey
× W_V
VValue

There's an honest wrinkle the paper doesn't hide. The router started with three lanes — DCT, an RBF kernel path, and full attention — but during training it quietly stopped sending anything to the RBF lane. The authors call this routing collapse: a learned router abandoning an option it decides it doesn't need, leaving a leaner DCT-plus-attention model. It's a useful reminder that a router is trained, not hand-wired — and sometimes it prunes your design for you. The result is still a mixed-operator transformer: different tokens, different operators, same layer.

Where the FLOPs actually go

Hold one attention layer fixed and count. Full attention pays the expensive Q·K score for every token, so call that 100% of the attention work. Chiaroscuro's reported 62.5% cut means only about 37.5% of that attention work survives — roughly speaking, only a minority of tokens still take the slow lane, and the rest are rerouted to the cheap DCT path (the exact split is setup-dependent, illustrative). The DCT lane isn't free, but it has no term — it costs a fixed amount per token instead of growing with the sequence — so moving the bulk of tokens off attention is where the 45% wall-clock speedup comes from. The win isn't a cheaper attention kernel; it's doing full attention far less often.

Three operators, one layer

OperatorCost per tokenWhat it doesFate in CHIAR-Former
DCT spectral mixingcheap, no termblends features in the frequency domain (JPEG-style)the default lane for most tokens
RBF kernel mixingmid (setup-dependent, illustrative)mixes through a radial-basis-function kernelabandoned by the router (routing collapse)
Full self-attentionexpensive, grows with compares each token against every other (scores)reserved for the minority

This isn't a claim that attention is wasteful everywhere — the paper is explicit that full attention still wins on small or synthetic datasets, where there's little to skim onto the cheap lane. It's that on real text like WikiText-103, a large fraction of tokens are captured well enough by a frequency-domain blend, and a quick spectral-entropy scan is enough to tell which is which. Routing on that signal is what lets a hybrid transformer block spend its attention budget where it earns its keep.

Goes deeper in: LLM Internals → Self-Attention → Computing Attention Scores

Related explainers

Frequently Asked Questions