Compute Where It Counts — Per-token compute controller
LLMThe news. On May 12, 2026, the “Compute Where It Counts” paper was posted to arXiv with an ICML'26 accept. The framing the authors give it is Self-Optimizing Language Models — a frozen base LLM paired with a lightweight policy network that, at each decode step, reads the hidden state and picks one discrete efficiency action for the next token (attention sparsity, MLP pruning, or activation bit-width). The reported result on reasoning benchmarks is comparable or higher accuracy at lower average FLOPs per token, with the largest savings on long contexts that are mostly easy filler punctuated by a few hard pivots.
Picture the chess master at the board. On an opening move they already know — pawn to e4 — the hand snaps out and the clock barely ticks. On a tactical pivot two hours later, where one move flips a winning position to a draw, they sit and stare for forty-five seconds. The total game-clock budget is the same either way; the master allocates it to where it changes the answer. A patzer who burns thirty seconds on every move, routine and tactical alike, runs out of clock without ever thinking harder where it matters.
A standard LLM is the patzer. Every output token is one forward pass with the same dense attention, the same full MLP, and the same activation precision — the same FLOPs whether the next token is “the” or the load-bearing numeric answer at the end of a derivation. SOL leaves the base model weights untouched and adds a policy network alongside it. Before each new token the policy looks at the LLM's current hidden state and picks one of three knobs to dial back: how many K/V pairs the attention attends to, how many of the FFN's neurons fire, and the bit-width of the activations. Easy tokens get an aggressively cheap action; hard tokens get the full dense-attention / full-MLP / full-precision action. Only the policy network is trained — the base LLM itself never changes weights.
The hero animation reads the three efficiency knobs as one compute-budget axis for visual clarity (illustrative abstraction; the underlying actions are attention sparsity, MLP pruning, and activation bit-width, not transformer depth). In the PROBLEM beat every token consumes the full 5-unit baseline budget — uniform blue bars at 100% average FLOPs. The MORPH beat fades in the ⚙ Controller chip representing the policy network. In the SOLUTION beat each token's bar interpolates to the budget the policy picked: filler tokens like “the”, “to”, the trailing period sit at 1–2 units in green (sparse attention + pruned MLP + low-bit activations); the integral symbol “∫₀¹”, “x²”, and the load-bearing “n=2” sit at 5 units in amber (full attention + full MLP + full precision). Average FLOPs per token drops to ~53% of baseline on this illustrative twelve-token sequence.
A worked numeric example (illustrative — exact savings are setup- and benchmark-dependent). Picture a model generating a 1,000-token reasoning trace where each token nominally costs 1 unit of full-budget compute, for 1,000 units baseline. Suppose the policy picks the cheap action (≈0.25 units) on 700 routine tokens (700 × 0.25 = 175 units) and the full action (1.0 units) on 300 pivotal tokens (300 × 1.0 = 300 units). Total: 475 units, ~48% of the baseline, at the same final answer. The savings concentrate exactly where you'd hope — long-but-mostly-easy contexts like agent loops, summarization, and structured outputs with lots of bookkeeping tokens.
Where it sits next to other per-token-compute levers
| Approach | What varies per token | How it's decided | Where it earns its keep |
|---|---|---|---|
| Static dense transformer | Nothing — every token is full attention + full MLP + full precision | n/a — fixed at training time | baseline, simple, easy to reason about |
| Static quantization / pruning / sparse attention | Compute level — but uniformly for every token | Set once at deploy time | blanket FLOP cuts, regardless of which tokens needed full precision |
| Mixture-of-Experts routing | Which experts fire — total FLOPs roughly constant per token | Learned router per MoE layer (top-k experts) | scaling model capacity without scaling per-token FLOPs (see Nemotron 3 MoE) |
| SOL — per-token compute controller (this paper) | Attention sparsity, MLP pruning, and activation bit-width — picked per token | Lightweight policy network over the frozen LLM's hidden state | long reasoning traces with mixed-difficulty tokens (savings setup-dependent, illustrative) |
| PPOW speculative-decode window | Window size — how many tokens the drafter proposes ahead | RL-trained drafter, KL-divergence reward | cuts wall-clock per decoded token, separate from per-token FLOPs |
Because the base LLM is frozen, the architecture and KV cache layout can stay familiar — but the runtime has to honor mixed per-token actions: each token may request a different attention sparsity, a different MLP-pruning ratio, or a different activation bit-width than the token next to it in the batch. In principle SOL composes with serving levers like paged attention and continuous batching, but the practical batching story is implementation-dependent — irregular per-token actions make the model executor's kernels less uniform, so engines need code paths for mixed-action batches.
The deeper lesson is a mental-model shift. Inference cost has historically been treated as fixed per token at model-design time: a 70B model costs 70B-worth of FLOPs per token, period. SOL breaks that — average FLOPs per token becomes a learned dial the policy network adjusts as the model generates, without retraining the base weights.
Goes deeper in: LLM Internals → Transformer Block → Why every layer pays the same per token
Related explainers
- Nemotron 3 Nano Omni — 30B-A3B multimodal MoE — a different per-token compute lever: vary which experts fire, not how many layers
- PPOW — window-level RL for speculative drafters — varies tokens-per-step at the drafter, complementary to varying FLOPs-per-token at the target
- DeepSeek V4 — long-context cost cut to a fraction — architectural per-token cost cuts, applied uniformly across all tokens