Compute Where It Counts — Per-token compute controller

LLM
L
Per-token compute controller — every token gets only the budget it needs1u2u3u4u5uCompute budgetTheanswerto∫₀¹dxequalsonethirdsincen=2.Per-token actioneasy (1–2)medium (3)hard (4–5)Average FLOPs per token (across the 12-token sequence)100%Baseline = 100% (full attention + full MLP + full precision per token) · Controlled ≈ 53% (illustrative)
learnaivisually.com/ai-explained/compute-where-it-counts-per-token-compute

The news. On May 12, 2026, the “Compute Where It Counts” paper was posted to arXiv with an ICML'26 accept. The framing the authors give it is Self-Optimizing Language Models — a frozen base LLM paired with a lightweight policy network that, at each decode step, reads the hidden state and picks one discrete efficiency action for the next token (attention sparsity, MLP pruning, or activation bit-width). The reported result on reasoning benchmarks is comparable or higher accuracy at lower average FLOPs per token, with the largest savings on long contexts that are mostly easy filler punctuated by a few hard pivots.

Picture the chess master at the board. On an opening move they already know — pawn to e4 — the hand snaps out and the clock barely ticks. On a tactical pivot two hours later, where one move flips a winning position to a draw, they sit and stare for forty-five seconds. The total game-clock budget is the same either way; the master allocates it to where it changes the answer. A patzer who burns thirty seconds on every move, routine and tactical alike, runs out of clock without ever thinking harder where it matters.

A standard LLM is the patzer. Every output token is one forward pass with the same dense attention, the same full MLP, and the same activation precision — the same FLOPs whether the next token is “the” or the load-bearing numeric answer at the end of a derivation. SOL leaves the base model weights untouched and adds a policy network alongside it. Before each new token the policy looks at the LLM's current hidden state and picks one of three knobs to dial back: how many K/V pairs the attention attends to, how many of the FFN's neurons fire, and the bit-width of the activations. Easy tokens get an aggressively cheap action; hard tokens get the full dense-attention / full-MLP / full-precision action. Only the policy network is trained — the base LLM itself never changes weights.

The hero animation reads the three efficiency knobs as one compute-budget axis for visual clarity (illustrative abstraction; the underlying actions are attention sparsity, MLP pruning, and activation bit-width, not transformer depth). In the PROBLEM beat every token consumes the full 5-unit baseline budget — uniform blue bars at 100% average FLOPs. The MORPH beat fades in the ⚙ Controller chip representing the policy network. In the SOLUTION beat each token's bar interpolates to the budget the policy picked: filler tokens like “the”, “to”, the trailing period sit at 1–2 units in green (sparse attention + pruned MLP + low-bit activations); the integral symbol “∫₀¹”, “x²”, and the load-bearing “n=2” sit at 5 units in amber (full attention + full MLP + full precision). Average FLOPs per token drops to ~53% of baseline on this illustrative twelve-token sequence.

A worked numeric example (illustrative — exact savings are setup- and benchmark-dependent). Picture a model generating a 1,000-token reasoning trace where each token nominally costs 1 unit of full-budget compute, for 1,000 units baseline. Suppose the policy picks the cheap action (≈0.25 units) on 700 routine tokens (700 × 0.25 = 175 units) and the full action (1.0 units) on 300 pivotal tokens (300 × 1.0 = 300 units). Total: 475 units, ~48% of the baseline, at the same final answer. The savings concentrate exactly where you'd hope — long-but-mostly-easy contexts like agent loops, summarization, and structured outputs with lots of bookkeeping tokens.

Where it sits next to other per-token-compute levers

ApproachWhat varies per tokenHow it's decidedWhere it earns its keep
Static dense transformerNothing — every token is full attention + full MLP + full precisionn/a — fixed at training timebaseline, simple, easy to reason about
Static quantization / pruning / sparse attentionCompute level — but uniformly for every tokenSet once at deploy timeblanket FLOP cuts, regardless of which tokens needed full precision
Mixture-of-Experts routingWhich experts fire — total FLOPs roughly constant per tokenLearned router per MoE layer (top-k experts)scaling model capacity without scaling per-token FLOPs (see Nemotron 3 MoE)
SOL — per-token compute controller (this paper)Attention sparsity, MLP pruning, and activation bit-width — picked per tokenLightweight policy network over the frozen LLM's hidden statelong reasoning traces with mixed-difficulty tokens (savings setup-dependent, illustrative)
PPOW speculative-decode windowWindow size — how many tokens the drafter proposes aheadRL-trained drafter, KL-divergence rewardcuts wall-clock per decoded token, separate from per-token FLOPs

Because the base LLM is frozen, the architecture and KV cache layout can stay familiar — but the runtime has to honor mixed per-token actions: each token may request a different attention sparsity, a different MLP-pruning ratio, or a different activation bit-width than the token next to it in the batch. In principle SOL composes with serving levers like paged attention and continuous batching, but the practical batching story is implementation-dependent — irregular per-token actions make the model executor's kernels less uniform, so engines need code paths for mixed-action batches.

The deeper lesson is a mental-model shift. Inference cost has historically been treated as fixed per token at model-design time: a 70B model costs 70B-worth of FLOPs per token, period. SOL breaks that — average FLOPs per token becomes a learned dial the policy network adjusts as the model generates, without retraining the base weights.

Goes deeper in: LLM Internals → Transformer Block → Why every layer pays the same per token

Related explainers

Frequently Asked Questions