Cost & Latency Engineering for AI Agents
The Cost Profile of an Agent
Where does an agent's cost actually go?
Almost entirely into input tokens — because every tick replays the full conversation. Output tokens are a smaller share. Tool wall-clock adds a constant per tick that compounds across the loop. Module 1 made the harness durable. Module 2 made the runs legible. Module 3 made the failures safe. This module is about making the bill defensible — knowing where the dollars and the milliseconds go before you start optimizing, because the wrong fix on the wrong tick is just clever waste.
The shape this module argues for: profile first, optimize one bottleneck at a time, and re-profile. The four levers (prompt caching, result caching, parallel tool calls, batching) are each a hammer that fits one of three nails — long inputs, too many ticks, or one slow tool. The diagnosis decides the lever, not the other way around.
Per-tick cost = inputs + outputs + tools
Each model call in the agent loop is one billable round-trip with a fixed shape:
per_tick_cost = input_tokens × input_price
+ output_tokens × output_price
+ tool_cost
per_tick_latency = model_latency
+ tool_latency
+ harness_overhead
The model is stateless. Every tick sends the entire prior conversation back to the provider, plus the system prompt, plus the tool definitions, plus the latest tool result. There is no "session" the provider keeps for you — there is only the request body, and the request body grows monotonically.
At Anthropic Claude Sonnet 4 pricing as of early 2026 — input around $3 per million tokens, output around $15 per million — a single tick with a 3,000-token system prompt, 1,500 tokens of tool definitions, 800 tokens of conversation history, and a 200-token model response costs roughly:
input = (3000 + 1500 + 800) × $3 / 1,000,000 ≈ $0.0159
output = 200 × $15 / 1,000,000 ≈ $0.0030
total ≈ $0.0189
About four-fifths of the per-tick bill is input. That ratio is the structural fact this module exploits. Other providers price differently — OpenAI's pricing page lists similar magnitudes for the GPT-4 family — but the ratio holds for any agent that drags a meaningful system prompt and tool set behind it.
The replay tax: cost grows superlinearly in ticks
Now run the same agent for ten ticks. Conversation history at tick 10 includes every prior message and every prior tool result — call it 1,800 tokens of accumulated turns. The system prompt and tool defs (4,500 tokens) re-ship on every single call. Total input tokens across the run:
| Tick | System + tools | History | Per-tick input | Cumulative input |
|---|---|---|---|---|
| 1 | 4,500 | 200 | 4,700 | 4,700 |
| 5 | 4,500 | 1,000 | 5,500 | ~26,000 |
| 10 | 4,500 | 1,800 | 6,300 | ~58,000 |
| 20 | 4,500 | 3,400 | 7,900 | ~143,000 |
Cumulative cost does not scale linearly. The system prompt is sent ten times at tick 10 — the same 4,500 tokens leaving your wallet again and again. By tick 20 you have spent more on resending the system prompt (90,000 tokens) than on every per-turn token combined. That is the lever Step 2 (prompt caching) targets directly.
The pattern recurs in every production trace: the largest line item on a long agent run is "system prompt, billed thirty-one times."
Latency has the same shape, on a different axis
Cost is one column of the spreadsheet. Latency is the other, and it has the same compounding logic:
wall_clock(task) = Σ (model_latency_tick_i + tool_latency_tick_i) + harness_overhead
A 700 ms model call and a 500 ms average tool call is 1.2 seconds per tick before the harness's own scheduling, retries, and policy checks (Module 3) add anything. A six-tick task is seven seconds even when everything works. The user is sitting there. The p95 task duration metric from Module 2's golden metrics is the production-side measurement of this number, and it is the one product complaints arrive about.
Three optimization moves change the latency math differently:
- Prompt caching (Step 2) cuts time-to-first-token on every tick, because the provider has already done most of the prefill.
- Parallel tool calls (Step 4) collapse a tick's tool wall-clock from
sum(latencies)tomax(latencies)when the tools are independent. - Result caching (Step 3) and batching (Step 5) cut tool latency itself.
Each lever is a different column of the cost-and-latency matrix. Picking the wrong one wastes engineering time on a problem the agent does not have.
The bottleneck rule: diagnose before you optimize
Profile the agent's actual cost per task by stage and the bottleneck is almost always one of three patterns. Each has a different fix.
| Pattern | What you see in the profile | Primary fix |
|---|---|---|
| Long inputs | Input slice is >70% of the bar on every tick; system prompt + tool defs dominate | Prompt caching (Step 2); trim tool defs you do not use |
| Too many ticks | Bar count is large; per-tick cost is reasonable; agent loops 12 times to do a 3-tick job | Better stop conditions; planning step; reflection from Foundations M6 |
| One slow tool | One stage of one tick dominates tool_latency; cost stays low but wall-clock balloons | Result caching (Step 3); parallel tool calls (Step 4); fix the tool |
The diagnostic step matters because the optimizations are not free. Prompt caching adds a write-once cost penalty (Step 2). Result caching introduces staleness bugs (Step 3). Parallel tool calls fan out concurrency at the cost of rate-limit risk (Step 4). Batching adds a window-of-delay tax and a non-trivial dataloader layer (Step 5). Every fix you ship is a complexity loan you take out against the agent's reliability budget. Pay it down on the lever that matches the bottleneck, not the one that was easiest to land in a sprint.
Module 1's harness gave you the per-tick view for free
A short callback to anchor where the profile data even comes from. Module 1's harness wrote one durable record per tick — the model call, the tool calls it issued, the result. Module 2's spans attached a structured trace to each of those records, with input_tokens, output_tokens, tool_latency_ms, and a stage label. The cost-by-stage chart Stage 1 of the sim draws is exactly what you get from grouping those spans by tick and stage. You already have the data; the question is whether you are looking at it.
The five golden metrics include cost_per_task for the same reason. If the metric is moving and you cannot say which stage's slice grew, the dashboard is information without insight. Profiling is what turns the number into a fix.
Where this sits relative to Foundations
Track A's Capstone plotted total cost across three designs of the same task — a deterministic workflow, a single retrieval-augmented call, and a tool-using agent. The point of that exercise was to show that the choice of design is the largest cost lever; the agent column was the most expensive of the three for a reason. This module zooms inside that agent column. Once you have decided "yes, this should be an agent," the four levers in Steps 2 through 5 are what keep the column from being five times more expensive than it has to be.
The four production agents I have seen with the lowest cost per task all did the same two things before any optimization: they shipped the per-tick cost profile to a dashboard, and they put a number on each bar of it for a typical task. Everything in this module follows from being able to look at that chart and say "this slice is the bottleneck, that slice isn't."
Try it: Stage 1 of the right pane is a per-tick stacked-bar profile of an agent run. Drag the tick count slider from 3 to 20 and watch the input slice (system prompt + tools, replayed every tick) eat the whole chart — that is the replay tax in graphical form. Then drag the system prompt size slider down to 500 tokens and watch the ratio flatten. The point is not to memorize numbers; it is to feel which slice grows under which knob, because the next four steps each attack a different slice.
Prompt Caching
What is prompt caching?
Prompt caching is provider-side caching of the prefix of your model input, so repeated calls with the same prefix pay a small fraction of the input price for the cached portion. You mark a cache breakpoint after the long stable section of your prompt; everything before it gets cached on the first call and re-used on subsequent calls; everything after it is billed at the full input rate every time. For an agent — where the same system prompt and tool definitions ship on every tick — this is the single largest lever on the cost profile from Step 1.
The mechanics differ by provider but the shape is the same. Anthropic's prompt caching is opt-in via cache_control blocks at the breakpoint; first write costs roughly 25% more than uncached input, subsequent hits cost roughly 10% of base input — Anthropic markets it as "up to 90% cost reduction and 85% latency reduction" on the cacheable portion. OpenAI's prompt caching is automatic for prompts at or above 1,024 tokens and applies a flat ~50% discount on the cached prefix with no marker required. Exact percentages drift — check the linked docs before pricing a budget — but the structural fact is stable: the cache is much cheaper than the original write, and the cached portion is whatever is byte-identical to a recent prior call.
The mechanics: a breakpoint splits "stable" from "per-turn"
A typical agent prompt has four logical segments, in order:
┌──────────────────────────────────────────────┐
│ 1. System prompt (~3,500 tokens) │ stable across the whole task
├──────────────────────────────────────────────┤
│ 2. Tool definitions (~1,500 tokens) │ stable across the whole task
├──────────────────────────────────────────────┤
│ 3. Conversation history (grows each tick) │ stable for prior turns
├──────────────────────────────────────────────┤
│ 4. Latest user/tool turn (~200 tokens) │ fresh every tick
└──────────────────────────────────────────────┘
▲
│ cache breakpoint goes here-ish
The cache key is the byte-exact prefix up to the breakpoint. On tick 1, the provider writes the entire prefix into its cache and bills the write. On tick 2, if the prefix is identical, the provider charges only the cache-hit rate for those tokens and the full input rate for the new suffix. The arithmetic that makes the lever so large:
- Without caching, a 5,000-token stable prefix at $3/M tokens (Anthropic Sonnet 4) costs $0.015 per tick. Across ten ticks, that is $0.15.
- With caching — one write at 1.25× and nine hits at 0.1× — the same prefix costs roughly
(5000 × $3 × 1.25 + 9 × 5000 × $3 × 0.1) / 1,000,000 = $0.0188 + $0.0135 = $0.032. About one-fifth of the uncached cost.
The 80% reduction is reasonable across realistic agent runs. The latency reduction is the same arithmetic from a different angle: the provider already has the cached prefix's intermediate state ready, so prefill is short-circuited and time-to-first-token drops accordingly.
Where to place the breakpoint
The rule is simple and the failures are loud when you get it wrong. Put the breakpoint after the long stable prefix and before the per-turn content. For most agents that means: after the system prompt, after the tool definitions, before the conversation history (or, if you are sure the history of prior turns is stable in this task, after the older turns and before the latest one — Anthropic allows up to four breakpoints).
| Marker position | Cached tokens per tick | Notes |
|---|---|---|
| None (no marker) | 0 (Anthropic) / auto if ≥1024 (OpenAI) | Anthropic: full price every tick. OpenAI: auto, 50% discount kicks in for long prefixes. |
| After system prompt only | ~3,500 | Misses the 1,500-token tool definitions; one tier of savings. |
| After system + tools | ~5,000 | The standard placement. Captures the longest stable block. |
| After system + tools + older history | ~5,000 + history | In theory caches the previous tick's full conversation; in practice each new tick grows the history and the marker bytes shift, so a single breakpoint here is re-paid every call. Anthropic supports up to four cache_control blocks, which lets disciplined teams layer breakpoints — most stop at "after tools" because it captures most of the win at the lowest complexity. |
The mistake to avoid is placing the breakpoint inside the per-turn content. The model API treats the cache key as a strict prefix — if a single token before the breakpoint differs from the previous call, the entire cache is invalidated and you pay the write penalty all over again. A marker placed after the latest user message is a marker that hits on zero subsequent ticks.
What invalidates the cache
Anything that changes the bytes before the breakpoint. The provider does not do fuzzy matching, semantic matching, or "close enough" — it does an exact prefix hash. The common ways to silently lose your cache:
- The system prompt changes. Even a one-word edit invalidates the prefix; the cache is keyed on the full byte stream.
- A new tool is added or an existing one's schema is edited. Tool definitions are part of the prefix.
- The model version changes. Different model = different cache namespace.
- The tool definitions are emitted in a different order. This is the most common own-goal. If your harness iterates a
Setor adictwhose iteration order varies between calls, the JSON for your tool definitions changes byte-for-byte and the cache misses every time. Pin tool order in code — sort by name at the boundary, every time. - Whitespace differences. Re-formatting the system prompt to "look nicer" invalidates every cached prefix in production.
A useful diagnostic: log the cache_creation_input_tokens and cache_read_input_tokens fields from the Anthropic response (OpenAI surfaces a similar cached_tokens field). The ratio of read-to-write across an agent's tick stream is the cache-hit rate; if it is not climbing into the 80–95% range across a normal task, something is invalidating your prefix and Module 2's traces are where you find what.
TTL: the agent that idles past the cache
Provider caches have a TTL. Anthropic's default cache TTL is around 5 minutes (with a longer-TTL variant available at higher write cost); OpenAI's prompt cache TTL is much longer — up to several hours. Exact numbers drift; check the linked docs before sizing a budget.
The TTL matters because agents idle. An agent that handles a customer interactively might pause for the user to think, for a webhook to fire, or for a long-running tool to complete. If the idle exceeds the cache TTL, the next tick re-writes the cache at full price even though the prefix is identical. For tasks that routinely cross the TTL boundary, the extended-TTL variant is often worth the extra write cost — do the arithmetic on your own distribution.
The corollary: a high-throughput route with short gaps between ticks is the best case for caching, because the cache stays warm across the whole conversation. A low-throughput route with long pauses (an email-triage agent that fires once per hour per inbox) gets much less out of the lever and might need a different design entirely — pre-emptive cache warming, or a redesign that keeps the per-call prefix smaller.
Anti-patterns that quietly defeat the cache
The four most common ways production teams ship prompt caching and then discover, three weeks later, that the cost line did not move:
- Re-emitting tool definitions in non-deterministic order. Sort tools by name at the boundary between the harness and the API call. If you cannot reproduce the exact byte string of your prompt prefix from a fresh process, your cache hit rate is a coin flip.
- Per-user data in the system prompt. Injecting the user's name, account ID, or session ID into the system prompt means every user has their own cache key. Move per-user data after the breakpoint.
- Forgetting to mark the breakpoint at all (Anthropic). The cache is opt-in; without the
cache_controlmarker, you get nothing. OpenAI does this automatically, which is convenient but harder to debug when it does not fire. - Cache write counted as "the optimization." A pipeline that writes the cache on every conversation but never reads it is a pure cost regression — you are paying the 1.25× write penalty without amortizing it. The metric to track is read-to-write ratio, not write count.
The cache becomes load-bearing in the cost model fast. A team that turns prompt caching on without auditing for these anti-patterns often sees a few percent improvement and concludes "caching does not help much for our shape," when the actual situation is that caching is invalidated on 95% of ticks for one of the reasons above.
Where this fits relative to the loop
Prompt caching is the optimization that targets Step 1's "long inputs" bottleneck. It does nothing for "too many ticks" — each tick still pays the cache-hit rate per stable token plus the full rate per fresh token, and a runaway agent paying the cache-hit rate eighty times is still spending eighty ticks of money. It also does nothing for "one slow tool" — the tool's wall-clock is not the model's prefill, and caching cuts model latency, not tool latency.
The order of operations in the next three steps follows from that observation: cache the input you have to send (this step), cache the tool results you would otherwise re-compute (Step 3), parallelize the tool calls inside one tick (Step 4), and batch across concurrent agents (Step 5). Each one is a different column.
Try it: Stage 2 of the right pane is a horizontal prompt diagram with the four segments laid out — system (3,500 tokens), tools (1,500 tokens), history, and the latest turn. Click to place the cache marker at each of the four positions and watch the side-by-side $ panels update — the "after system + tools" position is the sweet spot for most agent shapes. Then flip the "tools change between calls" toggle: notice that the "after system" marker still helps (tools sit past it, so the cached bytes are unaffected), but "after tools" collapses to near-uncached. That is the tool-order anti-pattern in graphical form — a marker can only protect bytes that are both before it AND stable.
Result Caching
What is result caching?
Result caching is a harness-level cache keyed on (tool_name, args) that returns a prior tool result without re-running the tool, for as long as a TTL says the result is still fresh. Step 2's prompt cache is provider-side and lives outside your code. This cache is yours — it sits inside the harness, you write it, you own its bugs, and the bugs are the kind that ship silently wrong answers to users. The two caches solve different problems: prompt caching cuts what you pay to send the request, result caching cuts what you pay to compute the answer.
The lever fits the "one slow tool" bottleneck from Step 1. An agent that calls lookup_order(id=A-1234) three times in one conversation should not hit your order database three times. The first call pays the wall-clock and the dollars; the next two should be free reads from a short-lived in-process or shared cache.
The mechanics: hash the call, store the result, gate on TTL
The data structure is small and the algorithm is one screen of code:
key = hash(tool_name + canonical_json(args))
cached = cache.get(key)
if cached and now - cached.ts < tool_ttl(tool_name):
span.set("cache", "hit")
return cached.value
result = await tool.run(args)
cache.set(key, {value: result, ts: now})
span.set("cache", "miss")
return result
Two things to notice in that snippet. First, the cache key is the canonical JSON of the arguments — sorted keys, no incidental whitespace, no undefined vs missing-field ambiguity. The classic foot-gun is {"a":1,"b":2} and {"b":2,"a":1} hashing differently and missing obvious hits. JavaScript's JSON.stringify does not sort keys; you have to do it explicitly. Second, the TTL is per tool, not global. A 30-second TTL is wrong for lookup_order and right for weather; a 6-hour TTL is right for lookup_company_profile and catastrophic for current_balance.
Module 2's spans are how this becomes legible — every tool-call span gets a cache=hit|miss|stale attribute, and the golden metrics dashboard can plot hit-rate per tool over time. Without that visibility you have a cache whose effects you cannot measure, and the next time the cost metric moves you will have no way to attribute the change.
When it pays: expensive tools called with repeated args
Agent loops are repetitive. The same conversation will look up the same user, the same order, the same product catalog entry over and over again as the model re-references context it has half-forgotten. The shapes where result caching pays for itself in cost and latency:
- Database lookups by ID.
lookup_order,lookup_user,lookup_product. The result is stable across many seconds; the cost of the DB round-trip is real. - Vector-store retrieval. Embedding the query is cheap, but the top-K read against your vector DB is not — and if the agent re-issues the same semantic query (paraphrased identically by the model) twice, the cache shape covers it.
- External HTTP calls with stable answers. Currency rates that change once an hour, weather that changes once every fifteen minutes, geocoding a constant address.
- Computed answers that are deterministic. Anything pure — a regex match against a fixed corpus, a price calculator with no clock dependence.
The cost case is straightforward (one round-trip beats N). The latency case can be larger: a 300 ms DB lookup avoided on three subsequent ticks saves nearly a second of wall-clock from a five-tick task — visible in the user's perception, not just on the bill.
When it lies: time-sensitive tools and silent staleness
The defining hazard of result caching is that stale answers do not announce themselves. The agent gets a number back from the cache, plugs it into its reasoning, and proceeds with full confidence that the number is current. If current_balance(uid=A-9) returned $42 thirty seconds ago and the user just made a $100 withdrawal, the cache will cheerfully serve $42 again and the agent will conclude the user can afford a transaction they cannot. There is no signal in the model's context that the value is stale. There is no exception, no warning, no field labeled "this is a guess." There is just a wrong number.
The categories that should default to "do not cache" or "very short TTL":
| Tool category | Suggested TTL | Why |
|---|---|---|
current_balance, account_status | 0 — do not cache | Mutates on every user action; stale = wrong number, downstream destructive. |
current_time, now() | 0 — do not cache | The whole point is freshness. Cache invalidates the tool's purpose. |
weather, currency_rate | ~5 min | Changes on a known cadence; small TTL is safe. |
lookup_order (after creation) | ~1 hour | State changes are infrequent and you can invalidate on write. |
lookup_company_profile | ~24 hours | Effectively static for the conversation; a stale company name is rare and recoverable. |
| Vector retrieval (stable index) | ~hours | Until the index re-indexes; tie the TTL to the indexer cadence. |
The numbers in that table are starting points, not law. The right TTL is whatever your data freshness contract says it is — if Finance commits to "balances are current within 60 seconds," your current_balance cache TTL is at most 60 seconds, and probably zero. Per-tool TTL beats a single global TTL because the cost of being wrong is per-tool, not per-cache.
The cache key is the contract
A cache that misses obvious hits is annoying. A cache that hits obvious misses is a bug. Both come from the cache key.
- Sort the JSON keys.
canonical_json({a:1, b:2})andcanonical_json({b:2, a:1})must produce the same byte string. Use a library — every language has one — and do it at the cache boundary, not at the call site. - Include every argument that affects the answer. Missing a
currencyargument from the key means EUR and USD requests collide. The hash should fingerprint the request fully. - Exclude arguments that do not affect the answer. A request-ID, a correlation token, a per-call timestamp — these inflate the key space and tank the hit rate.
- Include the tool version when it matters. If you ship a v2 of
compute_shipping(...)that returns a different shape, you do not want v1's cached results served to v2 callers. Add the tool version to the key prefix. - Include the user / tenant when answers are user-scoped.
lookup_order(A-1234)had better not return tenant A's order to tenant B because the cache forgot whose query it was. This is the worst possible failure mode in a multi-tenant system; the key must include the tenant boundary even if the tool's arg list does not.
The last point is where multi-tenancy meets caching and the system catches fire. A cache that conflates tenants is a data leak waiting for a regulator. Either scope the cache per tenant at the storage layer, or include the tenant ID in every key — explicitly, in code, with a test that asserts cross-tenant requests hash differently.
Invalidation strategies, from lazy to disciplined
There are three patterns for invalidating the result cache, and they are ordered by effort and quality:
- TTL (the lazy default). Pick a number, accept that the cache may be wrong for up to that number of seconds, monitor staleness incidents. Easy to implement, easy to reason about, and the only practical option when the harness does not control the downstream system.
- Event-driven invalidation (write-through). When the harness performs a write that mutates a cached value, it explicitly clears the affected keys.
issue_refund(order=A-1234)invalidateslookup_order(A-1234). Requires the tool layer to know what it touched. - Push-based invalidation (downstream notifies). The downstream system (the database, the inventory service) publishes change events; the harness subscribes and evicts matching keys. The most accurate, the most operationally complex, and overkill for most agents.
Most teams start with TTL alone, escalate to event-driven for the two or three tools that are obviously mutating, and never get to push-based. That is fine. The discipline is not "use the most sophisticated pattern"; it is "pick a pattern per tool and document it in the same file as the TTL," so the next engineer who touches the agent can see why lookup_order is 1 hour and current_balance is zero.
Operational concerns: size, eviction, and the dashboard
A few practical defaults worth knowing before this becomes a production incident:
- Size cap. An in-process LRU with a few thousand entries is usually plenty per pod; an out-of-process shared cache (Redis, Memcached) scales further but adds a network hop that has to be cheaper than the tool it replaces.
- Failure mode. If the cache layer itself is unavailable, the harness should fall back to running the tool — not to returning an error. The cache is an optimization, never a load-bearing dependency.
- Observability. Per-tool hit rate, p50 cache-read latency, count of evictions, count of staleness incidents reported downstream. Without these, you cannot tune TTLs from data; you tune them from vibes. Module 2's traces attach
cacheandtool_nameattributes to every tool span, which is what makes the hit-rate-per-tool chart possible. - Cost accounting. Hit-rate × per-call cost is the dollars saved. Plot it next to the cost-per-task metric and you can defend the cache's existence in code review.
The dashboard is the part teams skip and regret. A cache without a hit-rate chart drifts toward "we think it works"; with the chart it drifts toward "here is the TTL we should tighten on weather because the hit rate on lookup_order is fine."
When result caching does not help
A short check before reaching for the lever. Result caching does not help when:
- Tool arguments rarely repeat within a session. Highly personalized queries, free-text searches, unique IDs per call.
- The tool is already cheap. A 5 ms in-memory lookup with no I/O is not worth caching; the cache lookup itself is comparable cost.
- The result is enormous. Caching a 50 MB response in a process-local LRU costs more memory than the tool costs to re-run.
- The agent does not loop. A single-tick workflow has nothing to cache across.
Steps 4 and 5 cover the levers for the cases this one does not. The four levers in this module are deliberately complementary: a tool that is too slow but unique gets parallelism (Step 4); a tool that is fast but called from many concurrent agents gets batching (Step 5).
Try it: Stage 3 of the right pane is a 12-call stream across five tools — lookup_order (×3, stable), weather (×3, slow churn — Tokyo and NYC), current_balance (×3, fast churn), current_time (×2, always fresh), and search (×1) — with a single global TTL slider. Set TTL to 1 hour and watch the current_time calls flip to STALE — the agent now believes the wrong second. Drop TTL to 10 seconds and the lookup_order hit rate craters even though the order data has not actually changed. The teaching is what one dial cannot do: five freshness contracts, one knob. The fix in production is per-tool TTL — the sim shows the naive single-TTL design failing on purpose so you can feel why the safe design has to differ per tool.
Parallelizing Tool Calls
When can you parallelize tool calls?
When the model emits multiple tool calls in a single tick and none of them depend on each other's outputs. Modern provider APIs — Anthropic's tool use, OpenAI's parallel function calling — let the model produce two or three tool_use blocks in one assistant response. The harness's job is to dispatch them. Serial dispatch awaits each call before starting the next; parallel dispatch fires them concurrently. For independent calls, parallel collapses the tick's tool wall-clock from sum(latencies) to max(latencies), and the user feels it.
This is the lever for Step 1's "one slow tool" bottleneck when the slow tool happens alongside other tools the model wanted to call anyway. It does not help when the slow tool is the only call in a tick — that is what result caching (Step 3) or fixing the tool is for. It also does not help across ticks, because each tick's tool depends on the model's next decision, which has not happened yet.
The wall-clock math
The arithmetic that makes parallelism a clear win on the right shape of tick:
serial_wall_clock = latency_A + latency_B + latency_C
parallel_wall_clock = max(latency_A, latency_B, latency_C)
For three calls of 300 ms, 500 ms, and 700 ms — a realistic mix for lookup_user, lookup_history, lookup_balance against the same record — that is 1,500 ms serial vs 700 ms parallel. The saved 800 ms is more than half the tick's tool wall-clock, and on a six-tick task it compounds. The user-perceived latency from Module 2's golden metrics — p95 task duration — usually has one fat tail dominated by exactly these multi-call ticks.
Cost is unchanged. You pay for the tokens regardless of dispatch order; you pay for each tool call regardless of dispatch order; the API charge is identical. Parallel tool calling is a pure latency lever, not a cost lever. (Step 5's batching is the lever that moves cost via wall-clock.)
The independence check
The decision "can these two calls run in parallel?" reduces to "do tool B's arguments depend on tool A's output?" If yes, the calls form a chain and the chain has to be serial — B cannot begin until A returns. If no, the calls are independent and both can fire at once.
The model usually surfaces independent calls organically when it needs related lookups about a single entity. Asked "what is the status of user A-9's most recent order?", the model often emits three parallel calls — lookup_user(A-9), lookup_recent_orders(A-9), lookup_account_status(A-9) — because they share the same input. Asked "find the user whose email is jane@example.com and tell me their balance", it has to chain: lookup_user_by_email("jane@…") first, then lookup_balance(returned_user_id).
A useful pattern in the system prompt: encourage the model to emit independent calls together when it can. Something like "When you need multiple lookups that do not depend on each other's results, request them in the same tool call response so they can run in parallel." The model is good at this with a single sentence of guidance; it is mediocre at it with no guidance, leaving wall-clock on the floor.
Mixed cases are the interesting ones. A tick might emit four calls where three are independent and one chains on the first. The right shape is partial parallelism: fan out the three, await all, then issue the dependent fourth.
tick begins
├─ lookup_user(A-9) ───┐
├─ lookup_history(A-9) │ all three fire at once
└─ lookup_orders(A-9) ──┘
│
▼
recommend_action(user, history, orders) # waits for the fan-out
│
▼
tick ends
The harness can recognize the dependency by inspecting the tool arguments — if recommend_action references the return value of lookup_user, it is downstream. In practice most provider APIs do not auto-detect this; the harness either fires everything the model emitted (and accepts that some calls will fan out trivially) or implements an argument-graph that orders calls by data dependency. Most teams start with the simpler version and add the graph only if profiling shows it matters.
Failure modes
Parallelism does not come free. Three classes of failure to design around.
One tool errors mid-flight. In a serial dispatch, an error from call A short-circuits B and C — sometimes the right thing (B and C were going to be wasted work), sometimes wrong (B and C were independent and would have succeeded). In a parallel dispatch using something like Promise.all (which rejects on the first rejection), one error cancels the others. Use Promise.allSettled (or your language's equivalent) so each call's outcome is handled independently and a single failure does not invalidate the work of the others.
const results = await Promise.allSettled([
callTool('lookup_user', args1),
callTool('lookup_balance', args2),
callTool('lookup_history', args3),
]);
// each result is {status: 'fulfilled', value} or {status: 'rejected', reason}
// report failures back to the model per-call, not as a whole-tick failure
The model handles per-call failures gracefully when the harness reports them in the same shape as successful results — tool_use_id: ..., is_error: true, content: "..." is enough for most providers. The model will then either retry, route to a different tool, or apologize to the user, and the rest of the tick's work has been preserved.
Rate limits arrive faster. A parallel fan-out of N calls hits the downstream rate limiter N times in the same instant. A serial dispatch of the same N calls spreads them across sum(latencies). For internal databases this is rarely a problem; for third-party APIs with strict per-second quotas, the rate-limit headroom can be the constraint that decides how much parallelism is safe. Cap concurrency at the harness — a semaphore of, say, 8 — so a model that emits 12 calls in one tick does not stampede.
Resource contention. Even when the downstream allows the load, the calling pod might not. Parallel HTTP requests share the local connection pool; parallel DB queries share the connection pool to the DB. A fan-out of 30 calls against a 10-connection pool serializes anyway, just opaquely. Match the harness's concurrency cap to the pool size below it.
Anti-patterns
A few patterns that look like parallelization but actually buy nothing or buy regression:
- Parallelizing across ticks. "We will run tick 5 in parallel with tick 4" is incoherent — tick 5's input is tick 4's output. The parallelism is inside a tick, never across them. Foundations Module 1's agent loop was explicit about this: each tick is one model decision and the next decision needs the prior result. Engineering teams that try to cheat this constraint usually end up issuing duplicate tool calls and confusing the model.
- Speculative parallelism. "Run all five plausible tools in parallel and use whichever the model would have chosen." Costs five times the tool spend on every tick, defeats result caching, and turns a small bill into a large one. The model decides; the harness dispatches. Do not invert that.
- Parallelizing a single call. If a tick has one tool call, there is no fan-out shape. Some teams add asynchronous wrappers and worker pools and discover, after the migration, that the only thing they changed was the complexity.
- Ignoring the dependency check. Firing two calls in parallel when the second needs the first's output produces a
nullargument or a model-visible error. The harness should detect the dependency (or trust the model to serialize calls when it sees the dependency) before fanning out.
Provider differences worth knowing
Both major providers support parallel tool calls but the surface differs slightly:
- Anthropic. The model emits multiple
tool_useblocks in one assistant message. The harness returns each tool's result as a separatetool_resultblock in the next user message, addressing eachtool_use_id. See Anthropic's tool use docs. - OpenAI. The model emits a
tool_callsarray on the assistant message. The harness responds with multipletoolrole messages, each carrying atool_call_id. See OpenAI's function-calling docs.
Exact JSON shapes drift — both APIs version their tool-use surfaces fairly often. The pattern (multi-call response, multi-result reply, addressed by ID) is stable across versions.
A second cross-provider note: model behavior on parallel emission varies. The same prompt asked of two different models can produce a single serialized chain in one and a parallel fan-out in the other. The system-prompt encouragement noted above moves the needle, but the most reliable signal is to log the number of tool_use blocks per assistant message in Module 2's per-tick spans and watch the distribution. If 90% of ticks emit exactly one tool call, parallelism is not your bottleneck and you should not invest the harness work; if 30% emit two or more, the lever is real.
Where this fits relative to the harness
Parallel dispatch is a property of the harness's tool-call site, not of the model API. The model produces a list of calls; what the harness does with that list is a code decision. That means parallelism interacts with everything Module 1 and Module 3 built around the tool-call site: idempotency (each parallel call still has its own idempotency key), the policy gate from Module 3 (the gate runs per call, in parallel), the result cache from Step 3 (cache reads happen before dispatch; misses fan out, hits short-circuit). Adding parallelism to a clean harness is a small code change; adding it to one that conflates the call site with the loop is the kind of refactor that takes a week.
The wall-clock saving compounds with caching in the right direction. A cache hit on one of three parallel calls collapses that branch to nearly zero; the parallel tick's wall-clock drops to max(remaining latencies). Caching and parallelism are not alternatives — they are complementary levers, and the sim's Stage 4 lets you feel the interaction.
Try it: Stage 4 of the right pane is one tick with three tools — lookup_user (300 ms), lookup_balance (700 ms), lookup_history (500 ms) — drawn as Gantt bars under "Serial" and "Parallel" timelines, with a wall-clock-saved number above. Drag a tool's latency slider and watch which timeline moves. Then check the "depends on tool above" box on the middle tool and watch the parallel timeline collapse back to a chain — the wall-clock saved drops to whatever parallelism remains across the unaffected branches. The point is to feel that independence is the precondition for the lever, not concurrency itself.
Batching at the Agent Layer
What is agent-layer batching?
Agent-layer batching is combining the same operation across many concurrent agent runs into one downstream call. Sixteen agents each handling a different customer all hit lookup_order(...) within the same hundred milliseconds; instead of sixteen round-trips to the database, a batcher in the harness collects the calls, fires one SELECT ... WHERE id IN (...) query, and fans the results back out to each agent. The cost moves because the per-call network and overhead amortize across a batch; the wall-clock moves because the downstream system services one query for the price of one. The complexity moves up too — batching is a non-trivial piece of harness — and the question this step is about is when the bill justifies the engineering.
This is the lever for high-throughput agents where the cost profile from Step 1 shows a flat per-tool cost slice that nonetheless dominates the bill because the call count is enormous. Steps 2 and 3 cut what each call costs; this step cuts how many calls there are.
Two flavors: tool-call batching and model-call batching
The two flavors are structurally similar — collect, flush, fan out — but they live at different layers.
Tool-call batching is the classic dataloader pattern, borrowed from the GraphQL world. The harness wraps a tool that takes a single ID and exposes the same interface to the agent (lookup_order(id)) while internally accumulating IDs across concurrent calls, flushing them as one batched downstream call, and routing each result back to the right caller. The agent does not know its call was batched. The downstream system sees fewer, larger requests.
Model-call batching is the other flavor. Several providers expose batch APIs for non-interactive workloads — OpenAI's Batch API and Anthropic's Message Batches both offer roughly a 50% discount on input and output tokens for requests submitted in bulk with relaxed latency SLAs (the batch typically completes within 24 hours). The shape is fundamentally different from interactive batching: you are giving up real-time response in exchange for the discount. The fit is offline workloads — overnight evals, bulk classification, document processing pipelines — not user-facing agents.
The rest of this step focuses on tool-call batching, which is the lever for live agents. Model-call batching is worth knowing about but the use case is narrower.
A primer on why batching works at all
A primer for readers who have not been through Track C's Module 8 (Batching). Batching helps when the downstream system has high fixed overhead per request and low marginal cost per added item. The classic case is GPU inference: loading the model weights into the SMs is expensive; running them across 32 user prompts is barely more expensive than running them across 1, because the weights are loaded once and the matmuls amortize across the batch. Memory bandwidth is the bottleneck, not arithmetic — the roofline model from the GPU track. Batching pushes a memory-bound kernel toward its compute-bound regime, where it becomes vastly more efficient per item.
Agent-layer tool batching is the same idea, one layer up. Issuing a database query has fixed overhead (network round-trip, connection acquisition, query parsing, plan caching) and low marginal cost per row returned. SELECT * FROM orders WHERE id IN (a, b, c, …) pays the round-trip once for any number of IDs in the list, up to the database's batch limits. The shape is identical to GPU batching; the medium is different.
The corollary is that batching only helps when the downstream actually has this shape. A tool that calls a third-party API with no bulk endpoint cannot be batched at the network layer — you would just be holding the calls in a buffer before issuing them serially. Batching belongs where the downstream offers a bulk interface.
The three batching windows
A batcher needs a flush trigger. Three patterns recur:
- Window-based. Open a batch for W milliseconds; everything submitted in that window flushes together. Simple, predictable, and adds up to W ms of latency to every call. Typical W is 5–50 ms — long enough to collect a useful batch under load, short enough not to feel like a stall.
- Size-based. Flush when the batch hits N items. Bursty, latency-variable, but maximally efficient when traffic is high. Typical N is 16–128 depending on the downstream's preferred batch size.
- Hybrid. Whichever trips first — flush when either W ms have passed OR N items have accumulated. This is the production default for most dataloaders, including DataLoader.js and similar libraries, because it adapts to both low-traffic and high-traffic regimes.
A typical hybrid configuration for a customer-service agent's order-lookup tool: W = 10 ms, N = 64. At 50 requests per second the batcher fills with a few items and flushes on the window; at 5,000 requests per second it fills with 64 items in under 10 ms and flushes on size. Both regimes pay an upper bound on latency tax.
t=0ms ┌──────────────────────────────────┐
│ batch opens │
t=2ms │ ← lookup_order(A-9) │
t=5ms │ ← lookup_order(A-12) │
t=7ms │ ← lookup_order(A-14) │
t=10ms │ window expires → flush │
│ SELECT ... WHERE id IN (...) │
t=14ms │ ← results return │
│ fan out to 3 callers │
└──────────────────────────────────┘
The fan-out is the part that quietly matters. Each waiting caller has its own future/promise; the batcher's flush has to resolve them all in one pass, mapping result rows back to caller IDs. Lose one mapping and you get a deadlock waiting for a result that never arrives. Most batching libraries handle this for you; rolling your own batcher needs a unit test that fires N concurrent calls and asserts each gets its own answer back.
The latency tax and where it disappears
Every batched call now pays up to W ms of wait for the window to close. At low traffic — one call per window — that is pure overhead, no amortization. A single agent under low load runs slower with batching turned on. This is the same lesson LLM serving's batching module makes for inference: a request with no neighbors in the batch waits for company that never arrives.
At high traffic the tax inverts. With 50 concurrent calls in a 10 ms window, each call pays ~10 ms of wait but saves ~50 round-trips worth of overhead at the downstream — the tax is paid once and the savings are paid out 50 times. The cross-over throughput is the design question: below it, batching hurts; above it, batching helps.
A simple heuristic: batch when the downstream's per-request overhead (round-trip + connection + parse) is larger than W. For a same-data-center Postgres query at sub-millisecond round-trip, W = 10 ms is too long; for an inter-region API call at 50 ms round-trip, W = 10 ms is short enough to be invisible. Tune to the downstream's shape, not to a default copied from a blog post.
When to do it, when not to
A short decision table for the lever:
| Situation | Batch? | Reasoning |
|---|---|---|
| Many agents, same downstream, bulk-capable | Yes | The canonical fit. Overhead amortizes, downstream prefers it. |
| One agent, low traffic, latency-sensitive end user | No | Pure latency tax. Batch of one is just a delay. |
| Offline eval / overnight pipeline | Yes — including model calls | Use the provider's batch API for the ~50% discount. |
| Tool calls hitting a third-party with no bulk endpoint | No | Nothing to amortize. Buffering serial calls just adds latency. |
| Mixed-arg calls (same tool, different shapes) | Sometimes | Only if the downstream can accept a heterogeneous batch. Often a separate batcher per arg shape is cleaner. |
| Bill is dominated by something else (long inputs, too many ticks) | No — fix that first | Steps 1, 2, and 3 are the right levers. Batching does not help. |
The repeated mistake — across teams that have batched many things and across teams batching their first thing — is reaching for the lever before profiling. A team that adds batching because "the bill is high" without checking whether tool-call count is actually the dominant cost line item ends up with a more complex harness and no measurable improvement. The Step 1 cost-by-stage chart is the precondition; the batcher follows from it, not the other way around.
The complexity cost is real
A working batcher is not a hundred lines. The pieces it has to handle:
- Concurrent submission from many call sites (one per agent, possibly across threads or async tasks)
- A flush timer and a size threshold, both racing
- Caller-to-result mapping for fan-out
- Failure isolation: one bad ID in a batch should not poison the others — degrade per-item if the downstream supports it, or split-and-retry if it does not
- Backpressure: what happens when the batcher is full and a new call arrives during a flush?
- Cancellation: a caller's context gets cancelled mid-batch; the batcher has to handle the orphaned future cleanly
- Observability: per-batch size, flush trigger, and downstream latency — Module 2's traces need a
batch_sizeandbatch_flush_reasonattribute on the spans the batcher creates
That is non-trivial harness code. Libraries exist (DataLoader.js for the JS world, aiodataloader for Python, batching middleware for most service meshes) and adopting one is almost always cheaper than rolling your own. The reason to roll your own is when the downstream's batch shape is exotic enough that no library fits — and that is rare enough to be worth treating as a yellow flag.
Where this hands off
Module 4 is the cost-and-latency view of the agent: per-tick profile (Step 1), prompt caching (Step 2), result caching (Step 3), parallel tool calls (Step 4), and batching (this step). Each one is a lever fitted to a different bottleneck in the profile. The right way to use the module is as a diagnostic: profile the agent, identify the dominant slice, apply the matching lever, re-profile.
Module 9 (Capstone: Reliability Operations) closes the track by treating these levers as the variables that an SLO trades against. A 200 ms p95 latency target for a customer-facing agent makes certain levers mandatory and certain levers (model-call batching, long-window tool batching) off-limits. A 10-second target for a background email triage agent makes the opposite choices viable. The framework is the same; the constraints choose the design.
Frequently Asked Questions
Try it: Stage 5 of the right pane runs 16 concurrent agents all calling lookup_order against a shared batcher, with sliders for the batch window (0–200 ms) and the size threshold. Drag the concurrent-agents slider from 4 up to 25 and watch the batched $/run drop while p99 latency stays nearly flat — that is amortization showing up on the metric cards. Now drop the agents back down to 2 and watch p99 climb: the batch window now starves, every call waits the full window for a partner that never arrives, and the lever inverts into a pure latency tax. The point is to feel that batching is throughput-conditional — the same configuration is a win at high load and a regression at low load, and the cross-over is the number to know.