Token Budgets — Affine-typed budget ownership
AgentThe news. On June 2, 2026, the Token Budgets paper landed: an empirical catalog of 63 production cost-overrun incidents in LLM-agent systems, pulled from a review of 21 orchestration frameworks spanning 2023–2026 and clustered into an 8-category failure taxonomy (inter-rater Cohen's kappa 0.837). As a mitigation, the authors ship a 1,180-line Rust crate that uses affine-type ownership to turn budget violations into compile-time errors. In controlled tests, single-agent runs never overshot (0/30) while multi-agent asyncio delegation overshot every time (30/30); the mitigated runs then logged 0 cap violations across 160 live-API tests. Read the paper →
Picture the group dinner. There's one prepaid gift card with $1,000 on it, and four friends who all want to order. The cheap, lazy move is for everyone to photocopy the card and assume they each have the full balance — four copies, four people each cheerfully spending $350, and a $1,400 bill arrives against a card that only ever held $1,000. The card was never debited as people spent, so nothing stopped the overshoot until the bill came. Affine-typed budget ownership is the opposite rule: there is exactly one card, and the only legal operation is to split it into prepaid sub-cards — the money physically moves out of the original, and a photocopy simply isn't allowed.
In an agent system the "photocopy" bug is a delegation fan-out: an orchestrator spawns parallel sub-agents, and each one reserves a chunk of the token budget against a cap that no single owner is decrementing. The paper's headline number is that this pattern overshot 30 out of 30 runs, while a single agent — which spends against one running total — overshot 0 of 30. The fix is to make the budget an affine value: the Rust compiler tracks it as use-at-most-once, so a code path where two sub-agents could both hold the same budget fails to type-check. The cap is enforced by construction rather than by an assert that fires after the tokens are already gone — the same shift from runtime to compile-time that separates a retry loop that quietly re-bills you from one that can't.
Where the budget actually goes
A back-of-envelope walk-through (illustrative cap and slice sizes; the overshoot and over-reservation counts are the paper's). Say the shared cap is 1,000 tokens and the orchestrator fans out to four sub-agents. Under static reservation each child grabs a fixed 350, and because the reservations are effectively copies, the total claimed is 4 × 350 = 1,400 — a 400-token (40%) overshoot that nothing rejects until the spend lands. Make the budget affine and the same 1,000 is split into owned slices — say 300 + 220 + 260 + 220 = 1,000 — where the fourth claim can only take what the first three left behind. The sum is bounded to the cap by construction, which is the property the paper's Rust crate enforces: across 160 live-API tests it logged 0 cap violations, where unbounded multi-agent delegation had overshot all 30 runs. Static reservation's habit of grabbing 4–6× the budget it needs (adaptive trims that to 2.11×) is the same waste, viewed from the other side.
| Approach | When the cap is checked | Multi-agent overshoot | Over-reservation |
|---|---|---|---|
| Runtime budget guard | at spend time — after tokens commit | possible (the default failure) | — |
| Static reservation | up front, no shared cap | 30/30 runs (Token Budgets paper) | ~4–6× (paper) |
| Adaptive reservation | re-estimated per call | not reported (paper) | ~2.11× (paper) |
| Affine-typed ownership | compile time — won't type-check | 0 violations / 160 tests (paper) | bounded to the cap |
The catch is that this only buys you safety where you can express ownership in the type system — a Rust crate gets it for free, a Python orchestrator built on asyncio.gather does not, which is exactly where the paper's 30/30 overshoots came from. But the lesson generalizes past the language: in a multi-agent team the budget is a shared resource, and who is allowed to hold it, and whether they can copy it, is a design decision — not something to discover when the bill arrives.
Goes deeper in: Agent Engineering → Cost & Latency Engineering → Where the tokens go
Related explainers
- StreamMA — Streaming inter-agent reasoning — a different multi-agent cost: wall-clock latency from serial handoffs, cut by pipelining rather than by bounding tokens
- Maestro — RL orchestrator over frozen experts — the orchestrator-over-sub-agents topology where this fan-out budget problem lives
- EFC — feedback-quality scaling law — what actually predicts agent-harness success, the other half of "spend the budget well"