What is spectral-entropy token routing?

It is a way to make attention cheaper by deciding, per token, which operator should process it. Chiaroscuro's CHIAR-Former computes each token's spectral entropy — a cheap measure of how spread out its signal is across frequencies — and uses that score to route. The paper reports that most natural-language tokens suit a cheap DCT spectral-mixing operator, so they skip the expensive attention computation, and only a minority are sent to full self-attention.

Why does Chiaroscuro cut attention FLOPs by 62.5%?

Full self-attention compares every token against every other, so it dominates the layer's cost and scales with the square of the sequence length. By rerouting the bulk of tokens to a DCT mixer that has no n² term, Chiaroscuro only pays full attention for a minority. On WikiText-103 the paper reports 62.5% fewer attention FLOPs and a 45% speedup over the full-attention baseline, while reaching 36.54 validation perplexity.

What is routing collapse, and why did the RBF lane disappear?

Routing collapse is when a learned router stops using one of its options and sends everything to the others. CHIAR-Former started with three operators — DCT spectral mixing, an RBF kernel path, and full attention — but during training the router stopped routing tokens to the RBF lane, so the authors dropped it, leaving a DCT-plus-attention model. It is a reminder that the router is trained, not hand-wired, and can prune a design choice on its own.

Chiaroscuro Attention cuts attention FLOPs 62% — Spectral-entropy token routing

Jargon

Self-attention: The transformer's core operation: every token compares itself against every earlier token, so the work scales with the square of the length (O(n²)). Primer: Attention → Computing Attention Scores.
Spectral entropy: A cheap per-token score of how spread out a token's signal is across frequencies. A flat, predictable token has low entropy; a busy, high-detail one has high entropy. Chiaroscuro uses it as the routing signal.
DCT spectral mixing: The cheap operator. A Discrete Cosine Transform (the same maths behind JPEG) mixes a token's features in the frequency domain instead of comparing it against every other token — far fewer FLOPs, no n² term.
RBF kernel mixing: A third, mid-cost operator the authors included — mixing through a radial-basis-function kernel. In training the router stopped sending tokens to it, so it was dropped (see routing collapse).
Routing collapse: When a learned router quietly abandons one of its options and routes everything to the others. Here the RBF path collapsed, leaving a DCT-plus-attention model.
Mixed-operator transformer: A block where different tokens are processed by different operators in the same layer, rather than one fixed operator for the whole sequence. Chiaroscuro's CHIAR-Former is one.
FLOPs: Floating-point operations — a count of the raw arithmetic a layer does. Attention FLOPs are the work spent on the all-to-all token comparison; cutting them is what makes Chiaroscuro faster.
Perplexity (PPL): How surprised a language model is by held-out text — lower is better. The paper reports it on the WikiText-103 benchmark.

The news. On June 6, 2026, the Chiaroscuro Attention paper (arXiv:2606.08327) introduced CHIAR-Former, a 4-layer hybrid transformer that routes each token to one of three operators — DCT spectral mixing, RBF kernel mixing, or full self-attention — based on its spectral entropy. On WikiText-103 it reports 36.54 validation perplexity with 62.5% fewer attention FLOPs and a 45% speedup over the full-attention baseline (PPL 66.62), while noting full attention still wins on small or synthetic datasets. Read the paper →

Picture the security line at an airport. Every traveler walks up to the same checkpoint, but they don't all need the same screening. A quick scan decides each person's lane: most are cleared into the PreCheck fast lane and through in seconds, and only a flagged few are pulled aside for the slow, thorough manual screening. The line moves fast because the expensive screening is rationed to the few who get flagged, not spent on everyone.

That rationing is exactly what Chiaroscuro does to self-attention. Ordinary attention is the slow manual screening applied to every token: each one is compared against every other, so the work grows with the square of the sequence length. Chiaroscuro adds a quick scan — each token's spectral entropy, a cheap read of how its signal spreads across frequencies — and uses that score to route. The paper finds that most natural-language tokens suit the cheap DCT spectral mixing lane, which blends features in the frequency domain (the same transform behind JPEG) with no token-to-token comparison; only a minority fall through to full attention scores. Because the cheap lane catches the bulk of the sequence, most tokens never pay the n² cost.

There's an honest wrinkle the paper doesn't hide. The router started with three lanes — DCT, an RBF kernel path, and full attention — but during training it quietly stopped sending anything to the RBF lane. The authors call this routing collapse: a learned router abandoning an option it decides it doesn't need, leaving a leaner DCT-plus-attention model. It's a useful reminder that a router is trained, not hand-wired — and sometimes it prunes your design for you. The result is still a mixed-operator transformer: different tokens, different operators, same layer.

Where the FLOPs actually go

Hold one attention layer fixed and count. Full attention pays the expensive Q·K score for every token, so call that 100% of the attention work. Chiaroscuro's reported 62.5% cut means only about 37.5% of that attention work survives — roughly speaking, only a minority of tokens still take the slow lane, and the rest are rerouted to the cheap DCT path (the exact split is setup-dependent, illustrative). The DCT lane isn't free, but it has no n² term — it costs a fixed amount per token instead of growing with the sequence — so moving the bulk of tokens off attention is where the 45% wall-clock speedup comes from. The win isn't a cheaper attention kernel; it's doing full attention far less often.

Three operators, one layer

Operator	Cost per token	What it does	Fate in CHIAR-Former
DCT spectral mixing	cheap, no `n²` term	blends features in the frequency domain (JPEG-style)	the default lane for most tokens
RBF kernel mixing	mid (setup-dependent, illustrative)	mixes through a radial-basis-function kernel	abandoned by the router (routing collapse)
Full self-attention	expensive, grows with `n²`	compares each token against every other (scores)	reserved for the minority

This isn't a claim that attention is wasteful everywhere — the paper is explicit that full attention still wins on small or synthetic datasets, where there's little to skim onto the cheap lane. It's that on real text like WikiText-103, a large fraction of tokens are captured well enough by a frequency-domain blend, and a quick spectral-entropy scan is enough to tell which is which. Routing on that signal is what lets a hybrid transformer block spend its attention budget where it earns its keep.

Goes deeper in: LLM Internals → Self-Attention → Computing Attention Scores

Related explainers

MiniMax M3 — MiniMax Sparse Attention — a different way to skip attention work: select a few KV blocks per query, instead of routing tokens to a cheaper operator
Parallax — local-linear attention — the linear-attention route: change what attention estimates rather than which tokens pay for it
I/O-optimal approximate attention — cut attention's memory traffic rather than its FLOPs

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based