What is a per-token compute controller?

In the 'Compute Where It Counts' paper (Self-Optimizing Language Models, SOL), the per-token compute controller is a lightweight policy network that sits next to a frozen base LLM. At every decode step it reads the LLM's current hidden state and picks one discrete efficiency action for the next token — how sparsely to do attention, how much of the MLP to prune, and the activation bit-width to use. The base LLM's weights never change; only the policy network is trained. Easy tokens get cheap actions; hard tokens get the full-attention / full-MLP / full-precision action.

Why does it matter compared to a standard LLM?

Standard inference spends the same FLOPs on every output token, regardless of whether the next token is a routine connector or the load-bearing pivot of a reasoning step. Reasoning traces, agent loops, and structured outputs are full of easy filler tokens. A per-token policy can cut average FLOPs per token — and therefore decode-side latency and dollars per token — while reportedly holding reasoning accuracy steady. The paper's ICML'26 result is comparable or higher reasoning accuracy at lower average FLOPs, with the largest savings on long contexts that are mostly easy.

How does it relate to Mixture-of-Experts routing?

Both vary per-token compute, but along different axes. MoE routing varies which experts fire per layer — total FLOPs per token are roughly constant (top-k experts times per-expert cost) — and the router is trained jointly with the base model. SOL leaves the base model entirely untouched and instead varies the per-token amount of compute (attention sparsity, MLP pruning, activation precision). The two are complementary in principle: a SOL-style policy could sit on top of an MoE base model and choose how aggressively to dial back each token's action. The engineering trade-off is that variable per-token actions make batched matmuls less regular.

Compute Where It Counts — Per-token compute controller

LLM

learnaivisually.com/ai-explained/compute-where-it-counts-per-token-compute

TL;DR

What is it: The “Compute Where It Counts” paper (a.k.a. Self-Optimizing Language Models, SOL) keeps the base LLM frozen and attaches a small policy network that picks a discrete per-token efficiency action — attention sparsity, MLP pruning, or activation bit-width — at each decode step.
Why it’s needed: Standard inference spends the same FLOPs on every output token, but reasoning traces are full of routine connectors that don't need the full per-step budget. Cutting average FLOPs per token directly cuts decode-side latency and serving cost.
vs previous: A standard transformer applies the same dense attention, full MLP, and full-precision activations to every token. SOL breaks that uniform-compute assumption — the policy picks a cheaper action on easy tokens and the full action on hard ones, with the model weights themselves untouched.

Jargon

FLOP: Floating-point operation — the unit cost of LLM compute. A larger model, or a denser per-token action, spends more FLOPs per generated token.
Decode: The phase that emits one new token at a time after the initial prefill — every token costs another forward pass, so per-token compute is the bottleneck. See Inference Engine → Model Executor.
Reasoning trace: A multi-step chain of thought the model emits before its final answer. These mix many routine tokens with a few load-bearing pivots — exactly the asymmetry a per-token controller exploits.
Average FLOPs per token: The mean compute the model spends across a generated sequence. Static models have one fixed value; per-token efficiency actions can drive this number down while keeping outputs equal.
Attention sparsity: Skipping a fraction of the K/V entries during attention so each new token attends to fewer past tokens than the full context. One of the three efficiency knobs SOL's policy can pick per token.
MLP pruning: Zeroing out (or skipping) a fraction of the feed-forward layer's neurons for this token, reducing the per-token FFN cost. The second efficiency knob.
Activation bit-width: The numeric precision used for the layer activations on this token (e.g. 16-bit vs 8-bit vs 4-bit). Lower bit-width cuts memory traffic and compute. The third efficiency knob.
Policy network: A small neural network (separate from the base LLM) that reads the LLM's hidden state at each decode step and outputs a discrete action — “which efficiency knob, set how aggressively” — for the next token.

The news. On May 12, 2026, the “Compute Where It Counts” paper was posted to arXiv with an ICML'26 accept. The framing the authors give it is Self-Optimizing Language Models — a frozen base LLM paired with a lightweight policy network that, at each decode step, reads the hidden state and picks one discrete efficiency action for the next token (attention sparsity, MLP pruning, or activation bit-width). The reported result on reasoning benchmarks is comparable or higher accuracy at lower average FLOPs per token, with the largest savings on long contexts that are mostly easy filler punctuated by a few hard pivots.

Picture the chess master at the board. On an opening move they already know — pawn to e4 — the hand snaps out and the clock barely ticks. On a tactical pivot two hours later, where one move flips a winning position to a draw, they sit and stare for forty-five seconds. The total game-clock budget is the same either way; the master allocates it to where it changes the answer. A patzer who burns thirty seconds on every move, routine and tactical alike, runs out of clock without ever thinking harder where it matters.

A standard LLM is the patzer. Every output token is one forward pass with the same dense attention, the same full MLP, and the same activation precision — the same FLOPs whether the next token is “the” or the load-bearing numeric answer at the end of a derivation. SOL leaves the base model weights untouched and adds a policy network alongside it. Before each new token the policy looks at the LLM's current hidden state and picks one of three knobs to dial back: how many K/V pairs the attention attends to, how many of the FFN's neurons fire, and the bit-width of the activations. Easy tokens get an aggressively cheap action; hard tokens get the full dense-attention / full-MLP / full-precision action. Only the policy network is trained — the base LLM itself never changes weights.

The hero animation reads the three efficiency knobs as one compute-budget axis for visual clarity (illustrative abstraction; the underlying actions are attention sparsity, MLP pruning, and activation bit-width, not transformer depth). In the PROBLEM beat every token consumes the full 5-unit baseline budget — uniform blue bars at 100% average FLOPs. The MORPH beat fades in the ⚙ Controller chip representing the policy network. In the SOLUTION beat each token's bar interpolates to the budget the policy picked: filler tokens like “the”, “to”, the trailing period sit at 1–2 units in green (sparse attention + pruned MLP + low-bit activations); the integral symbol “∫₀¹”, “x²”, and the load-bearing “n=2” sit at 5 units in amber (full attention + full MLP + full precision). Average FLOPs per token drops to ~53% of baseline on this illustrative twelve-token sequence.

A worked numeric example (illustrative — exact savings are setup- and benchmark-dependent). Picture a model generating a 1,000-token reasoning trace where each token nominally costs 1 unit of full-budget compute, for 1,000 units baseline. Suppose the policy picks the cheap action (≈0.25 units) on 700 routine tokens (700 × 0.25 = 175 units) and the full action (1.0 units) on 300 pivotal tokens (300 × 1.0 = 300 units). Total: 475 units, ~48% of the baseline, at the same final answer. The savings concentrate exactly where you'd hope — long-but-mostly-easy contexts like agent loops, summarization, and structured outputs with lots of bookkeeping tokens.

Where it sits next to other per-token-compute levers

Approach	What varies per token	How it's decided	Where it earns its keep
Static dense transformer	Nothing — every token is full attention + full MLP + full precision	n/a — fixed at training time	baseline, simple, easy to reason about
Static quantization / pruning / sparse attention	Compute level — but uniformly for every token	Set once at deploy time	blanket FLOP cuts, regardless of which tokens needed full precision
Mixture-of-Experts routing	Which experts fire — total FLOPs roughly constant per token	Learned router per MoE layer (top-k experts)	scaling model capacity without scaling per-token FLOPs (see Nemotron 3 MoE)
SOL — per-token compute controller (this paper)	Attention sparsity, MLP pruning, and activation bit-width — picked per token	Lightweight policy network over the frozen LLM's hidden state	long reasoning traces with mixed-difficulty tokens (savings setup-dependent, illustrative)
PPOW speculative-decode window	Window size — how many tokens the drafter proposes ahead	RL-trained drafter, KL-divergence reward	cuts wall-clock per decoded token, separate from per-token FLOPs

Because the base LLM is frozen, the architecture and KV cache layout can stay familiar — but the runtime has to honor mixed per-token actions: each token may request a different attention sparsity, a different MLP-pruning ratio, or a different activation bit-width than the token next to it in the batch. In principle SOL composes with serving levers like paged attention and continuous batching, but the practical batching story is implementation-dependent — irregular per-token actions make the model executor's kernels less uniform, so engines need code paths for mixed-action batches.

The deeper lesson is a mental-model shift. Inference cost has historically been treated as fixed per token at model-design time: a 70B model costs 70B-worth of FLOPs per token, period. SOL breaks that — average FLOPs per token becomes a learned dial the policy network adjusts as the model generates, without retraining the base weights.

Goes deeper in: LLM Internals → Transformer Block → Why every layer pays the same per token

Related explainers

Nemotron 3 Nano Omni — 30B-A3B multimodal MoE — a different per-token compute lever: vary which experts fire, not how many layers
PPOW — window-level RL for speculative drafters — varies tokens-per-step at the drafter, complementary to varying FLOPs-per-token at the target
DeepSeek V4 — long-context cost cut to a fraction — architectural per-token cost cuts, applied uniformly across all tokens