AsyncFC paper — Symbolic futures in the decode stream

Agent
L
AsyncFC hero animation — a single Gantt timeline with three rows (decode, tool 1, tool 2) and a sweeping now-cursor. When the decoder emits a tool-call token, a symbolic-future badge appears above the decode row; the tool bar begins running in parallel, and the badge flips to a checkmark when the bar finishes. Decode never stops. The animation closes with a 11s vs 6.5s wall-clock comparison against the synchronous alternative.
learnaivisually.com/ai-explained/asyncfc-symbolic-futures

The news. On May 14, 2026, a research paper introduced AsyncFC — a futures-based async function-calling framework that overlaps LLM decoding with tool execution. The authors show that current LLMs already handle symbolic future placeholders without any retraining: the harness inserts a typed referent into the context, dispatches the tool in the background, and substitutes the real value before the next forward pass that depends on it. End-to-end task time drops while task accuracy holds. Read the paper →

Picture the coffee counter. The slow path is standing in front of the cashier waiting for your custom drink to be made before you order anything else. You stare at the espresso machine for two minutes. The line backs up. Nothing else happens. That is exactly what a synchronous tool-call loop looks like inside an LLM agent: the model emits a tool call, the harness blocks, the tool runs, the result comes back, the model resumes. While the tool is running the decoder is idle. The most expensive accelerator in your stack is doing nothing.

The fast path is walk away with a buzzer. You place your order, the cashier hands you a small puck that will vibrate when your drink is ready, and you move on — you can order something for a colleague, you can find a table, you can plan what you'll do once the drink arrives. The buzzer is the symbolic future. It's not the drink. It stands in for the drink. You can hold it, talk about it ("I'll grab the coffee in a minute"), even commit to actions that depend on it ("once the coffee's here we'll head out"). The actual drink only matters when you reach the moment that physically requires it.

AsyncFC does exactly this for the decoder. When the model emits a tool call, the harness immediately yields a typed placeholder token — written something like ⟨fut1⟩ — back into the decoding stream and dispatches the real tool call asynchronously. The model keeps decoding. It can issue more tool calls, reason over the future by name, plan what it will do once the result arrives. When the tool resolves, the harness substitutes the real value for the placeholder before the next forward pass that actually depends on reading it. No weight update, no fine-tune, no special token vocabulary. The model just treats the placeholder as a regular token it can plan around.

The catch — and this is where AsyncFC earns its name — is that the model must be able to reason over a not-yet-resolved future without crashing. The paper's empirical claim is that current LLMs already do this. The placeholder is typed (it's known to be, say, a search-results list or a numeric answer), and that type is enough for the model to keep generating plans that condition on the future without trying to materialize it. The savings show up at the agent layer: when several tool calls are independent of each other, the agent emits the whole chain of dispatches up front and decodes through them while the eligible tools run in parallel. The dependency structure of the task — which future gets needed when — sets the only true critical path.

Where the wall-clock time actually goes

The cost-and-latency profile of an agent — covered in the Agent Engineering track's Cost & Latency module — is dominated by two big buckets: token decode and tool execution. As a typical industry rule of thumb (not a paper claim), a single decode step runs in the tens of milliseconds while a tool call (a search API, a code-execution sandbox, an MCP server) runs in the hundreds of milliseconds to several seconds. A synchronous loop adds the two buckets. AsyncFC overlaps them.

Picture a toy 12-token response that emits two tool calls at positions 4 and 8 — say, a search and a calculation. Hold token decode at 500 ms each and tool execution at 2.5 s each for illustration. The sync loop walks like this: 4 tokens (2 s), tool 1 (2.5 s), 4 tokens (2 s), tool 2 (2.5 s), 4 tokens (2 s) — 11 seconds total, with the GPU idle for 5 of them. AsyncFC walks like this: 12 tokens continuously (6 s), with tool 1 running in parallel from t=2 s to t=4.5 s and tool 2 running from t=4 s to t=6.5 s — 6.5 seconds total, with the GPU never stopping. Same model, same tools, same final answer; the wall-clock difference is the time saved.

This pattern is not unique to AsyncFC. Anthropic's Model Context Protocol shipped a structurally similar idea on May 15, 2026 with SEP-2663: MCP servers can return a Task handle instead of a blocking result, and the client polls it via tasks/get. AsyncFC is the model-side counterpart — what the decoder does while it's waiting. Both pieces want the same thing: stop letting tool latency dictate agent latency.

The boundary of what AsyncFC can speed up is the true data dependency between tool calls. If every tool call's input depends on the previous tool's output, you can't overlap them — the buzzer for drink #2 can't fire until you know what drink #1 was. But in practice agentic workflows are full of parallelizable tool dispatches: search this, search that, look up the user, fetch the schema. Those are the calls AsyncFC compresses into one critical path the length of the slowest tool.

Goes deeper in: Agent Engineering → Cost & Latency → Parallelizing Tool Calls

Frequently Asked Questions