Agent Teams — Multi-Agent Orchestration
When Teams Beat a Single Agent
What is a multi-agent team?
A multi-agent team is a system in which more than one LLM-powered agent — each with its own prompt, tools, and sometimes its own model — collaborates on a single task. The team only earns its existence when the coordination tax it pays (extra tokens, extra latency, extra debugging surface) is less than the specialization gain it produces (higher pass rate, better coverage, faster wall time). For most tasks that one agent handles well, adding a second one is a tax on no benefit. For a narrow class of tasks — multi-domain research, parallel code-and-verify, long open-ended builds — the team wins so decisively that further single-agent prompt engineering hits a clear ceiling well below what the team can reach. This module is about telling those classes apart, and about the specific topologies that make the win happen.
The seven prior modules of this track all assumed one well-instrumented agent: a harness (Module 1), traces (Module 2), guardrails (Module 3), cost discipline (Module 4), evals (Module 5), deploys (Module 6), incident response (Module 7). This module is the first one where the unit of analysis is not the agent but the team — and every discipline from the prior seven gets a multi-agent variant: a per-team trace, a per-handoff guardrail, a per-worker cost line.
Why a single agent is the right default
The honest opening: most agent products in production are single-agent. ChatGPT, Claude.ai, Cursor's chat panel, GitHub Copilot's autocomplete — all single-agent. A single agent has one context window to manage, one prompt to debug, one cost line to attribute, one trace to read. Each of those is a non-trivial engineering surface; multiplying by N is not free.
The bar to add a second agent should be deliberately high. A useful default rule:
- Push the single agent to its eval ceiling first. Better prompt, better tools, better retry policy, better caching — these compound. Most teams that "needed a multi-agent system" needed a better single agent.
- Establish what the pass-rate gap actually is. If the lift from going multi-agent is less than 5 percentage points on a real benchmark, the tax probably isn't worth it.
- Accept that the cost will roughly multiply by the number of distinct agents on the critical path. Anthropic's internal multi-agent research system reports running approximately 15× the tokens of a single chat for that reason — the win has to be worth the multiplier.
When all three gates pass, a team is the right answer. When any one of them fails, single is.
The four task classes
A useful classification — every task an agent product handles falls into one of four buckets. The team-vs-single answer depends almost entirely on which bucket:
- Lookup. "What is our refund policy?" The agent fetches one document, reads it, answers. A single agent gets ~95% of these right at near-zero cost. Adding a team adds zero pass-rate and an order of magnitude in tokens. Single wins, decisively.
- Single-domain reasoning. "Debug this Python stack trace." One domain, one set of tools, one chain of reasoning. A single agent does it in 4k tokens at ~78%. A team can lift this to ~82% by parallelising hypotheses, but at 8× the tokens — usually not worth it. Single still wins, marginally.
- Multi-domain research. "Compare PostgreSQL vs MySQL for write-heavy workloads — pull docs, write a benchmark, summarize." Three distinct sub-tasks, each benefiting from a specialized prompt. A single agent context-collapses trying to hold all three in one head; a team splits cleanly and wins by ~30 pp. Team wins decisively.
- Open-ended build. "Build a Slack notification bot from scratch." Long horizon, ambiguous spec, multiple tool surfaces. A single agent gets ~38% (often loops or stalls); a team can hit ~64% but at 15× tokens. Team wins, but the win is expensive — only worth it when 64% > 38% pays for the bill.
The pattern: the team wins on tasks the single agent can't already handle, and the win scales with how many distinct sub-domains the task touches. The failure mode is using a team on a task the single agent already handles — the team "works" but burns money for nothing.
What "specialization" actually means
A team only beats a single agent when each agent is doing meaningfully different work. Three engineers running the same prompt with different temperatures is not a team — it's one agent rolled three times, and the gain is the same as raising temperature once. Real specialization looks like:
- Different prompts. A research agent's prompt is "find evidence and cite sources;" a coder agent's is "write code that passes the tests;" a reviewer agent's is "find what's wrong with this output." These prompts cannot be fused — each is optimised for a different kind of correct answer (cited prose vs working code vs flagged errors), and a single mega-prompt that tries to do all three almost always degrades on each.
- Different tools. A research agent has web search and a docs index; a coder has filesystem and
pytest; a reviewer has a diff checker. The tool surface enforces the specialization — the coder can't go searching, even if the prompt would let it. - Different models. A planning agent on a frontier model; worker agents on a smaller, faster model. This is a cost-engineering specialization that real production systems use heavily — see Module 4 on cost engineering.
A "team" of agents that all run the same prompt with the same tools is just one agent run N times — and N-times-one is the parallel agents + voting pattern from Step 3, which is its own thing with its own gating conditions. This module's team is the specialized kind, where each agent does work the others can't.
Where the published evidence is
Three reference points worth tracking, because the public literature on multi-agent systems is much thinner than on single agents:
- Anthropic — How we built our multi-agent research system reports the 15× token tax for deep research, and the engineering rationale for why specialized sub-agents won over a single agent at deep-research workloads.
- OpenAI — Swarm framework (experimental, open-source) formalises a minimal handoff protocol; not a production product. The production successor is the OpenAI Agents SDK, which keeps the handoff shape but adds typing, tracing, and guardrail hooks.
- LangGraph and Crew.ai ship production-oriented multi-agent frameworks; their docs are the most concrete public source on supervisor/worker and handoff conventions.
The picture across these: multi-agent is a real win for a narrow class of problems, the token tax is roughly an order of magnitude, and the engineering investment is non-trivial. None of them claim multi-agent is universal — and you should be skeptical of any source that does.
Try it: Stage 1 of the right pane is the single-vs-team comparison. Pick each of the four task chips (Lookup, Single-domain reasoning, Multi-domain research, Open-ended build) and toggle between Single agent and Specialist team. Watch the three metrics (success rate, total tokens, wall latency) and the verdict box at the bottom. Lookup shows the team paying ~11× tokens for +1pp success — the team is wasteful. Single-domain reasoning shows +4pp at 8× tokens — usually not worth it. Multi-domain research shows +29pp at 12× tokens — the team is decisively winning. Open-ended build shows +26pp at 15× tokens — the team wins but the bill is real. The same team topology is wasteful, marginal, decisive, or expensive depending entirely on what task you put it on.
Supervisor / Worker
What is a supervisor / worker topology?
Supervisor / worker is the multi-agent topology where one agent (the supervisor) plans the work, dispatches sub-tasks to specialist worker agents, and synthesizes their outputs into a final answer. The supervisor sees the full task and the global plan; each worker sees only its narrow sub-task and produces a narrow output. The defining structural decision is what shape the workers' outputs take when they come back: raw outputs (which the supervisor then has to read in full) or summaries (which the supervisor reads cheaply but which require workers to do their own self-summarization). That decision determines whether the supervisor's context window survives the synthesis step or collapses under the load.
This is the multi-agent execution of Foundations Module 3's orchestrator-workers pattern — same shape, but the workers are now full agents with their own loops rather than single LLM calls. The escalation matters because each worker can now use tools, make multiple model calls, and produce outputs much larger than a single prompt response — which is exactly why the supervisor sees what? question becomes load-bearing.
The four phases of a supervisor / worker run
Every well-designed supervisor / worker system walks the same four phases. Naming them is useful because each phase has its own failure mode and its own tuning lever:
- Decompose. The supervisor reads the user prompt and decides what sub-tasks to spawn — and how many. Failure mode: over-decomposition (10 workers when 3 would do) — a tax with no benefit. Under-decomposition (one mega-worker that re-invents the supervisor) — defeats the point.
- Dispatch. The supervisor sends each worker its sub-task prompt + the constrained context that worker needs. Failure mode: dispatching the full user prompt to every worker, which destroys the specialization gain (workers all see everything and converge on similar answers).
- Work. Workers run in parallel. Each worker is a complete agent — its own model call(s), its own tool use, its own retries. Failure mode: workers leaking into each other's domains because their prompts weren't scoped tightly enough.
- Synthesize. The supervisor receives worker outputs (raw or summarized) and composes the final answer. Failure mode: context overflow — the workers returned 12k tokens of evidence and the supervisor's window is 8k.
Phase 4 is where the topology lives or dies, and it's the design knob this step is about.
The output-shape decision
When a worker finishes, what does it return to the supervisor? Two options:
- Raw outputs. The worker returns everything it produced — the full doc excerpts, the full code, the full draft paragraphs. The supervisor reads the raw output and synthesizes from it directly.
- Summaries. The worker returns a structured summary — "here's a 200-token version of what I found, here's a citation list, here's the one paragraph I'd include." The supervisor reads the summary, not the raw evidence.
The math forces the choice. If three workers return roughly 4.2k, 2.8k, and 5.1k tokens of raw output, the supervisor synthesizes against ~12.1k tokens of worker output plus its own 1.2k of system prompt and plan — ~13.3k total. On a model with an 8k working window, the synthesis fails (or, worse, silently truncates and the supervisor produces a confident answer based on the parts it could fit). On a 32k window the same load fits — but the supervisor is now paying 12k+ of input tokens at synthesis-time, every time, and the cost line is real.
If the same three workers each produce a 300-token summary, the supervisor synthesizes against ~1.9k tokens — well inside any working budget, and at a fraction of the synthesis cost.
The cost the summary contract pays: each worker has to do its own self-summarization, which means a second model call per worker (one to do the work, one to summarize it). That call is paid in worker tokens, where input is small (the worker's own output) and the model can be smaller. The trade is good: worker self-summary at smaller-model rates is much cheaper than supervisor synthesis on a frontier model with the full raw output.
When raw outputs are the right call
The summary contract isn't universally correct. Three cases where raw outputs make sense:
- The worker's output is already small. A single tool-call result, a 50-line code snippet, a one-paragraph draft — there's nothing to summarize.
- The supervisor needs to verify the evidence. A research synthesis that has to cite primary sources can't accept a summary; it needs to read the actual quoted text to attribute correctly. The summary path here would lose the audit trail.
- The supervisor and worker are the same model. When you're running a smaller model as both supervisor and worker (a cost optimization), the summary call costs nearly as much as the raw read — the savings disappear.
The default for a frontier-model supervisor with multiple meaningfully-sized workers is summaries. Raw outputs are an opt-out for the cases above, not the starting point.
How the supervisor knows when to stop
A real failure mode in supervisor / worker systems: the supervisor keeps spawning more workers because each round's output looks "almost there." This is the evaluator-optimizer pattern from Foundations Module 3 gone rogue — a supervisor that can't commit. Two real-world stops the production frameworks bake in:
- A hard sub-task budget. The supervisor declares N sub-tasks at decomposition time and is structurally prevented from adding more. If the answer isn't there after N workers, the answer isn't coming from this run; the supervisor has to escalate to a human or return what it has.
- A token / cost budget per task. The system tracks total tokens across the supervisor + all workers; when the budget is exhausted, the supervisor must synthesize from what's present. This is the same shape as the retry budget from Module 1 applied to the team.
Without one of these, a single bad task can spawn dozens of worker calls and burn an absurd cost ceiling — exactly the failure mode that gives multi-agent systems a bad name.
Frameworks that ship this topology
The major production-oriented multi-agent frameworks all expose a supervisor / worker primitive. The names and idioms differ; the shape is the same:
- LangGraph's "supervisor" pattern — the supervisor is a node in a graph; workers are nodes the supervisor routes to.
- Crew.ai's hierarchical process — same shape, named "manager / agent."
- OpenAI Swarm — "routines and handoffs;" supervisor/worker is a particular handoff topology, not a primitive.
Picking a framework is a separate question. Understanding the four-phase shape and the raw-vs-summary trade carries across all of them.
Try it: Stage 2 of the right pane shows the supervisor / worker run for the prompt "Compare PostgreSQL vs MySQL for write-heavy workloads." Click ▶ Run topology and watch the four phases (decompose → dispatch → work → synthesize) play out. Three workers (Researcher, Engineer, Writer) appear and each returns a token count. Now toggle between RAW outputs and SUMMARIES and run again. With raw outputs, the supervisor's context-window meter shows ~13.3k against an 8k budget and the verdict goes red — the synthesis fails or truncates. With summaries (~980 worker tokens total), the meter stays comfortably under budget and the verdict goes green. The same topology, the same workers — only the output-shape contract changes, and the supervisor either survives synthesis or doesn't.
Parallel Agents & Voting
What is parallel agents with voting?
Parallel agents with voting is the multi-agent pattern where N copies of an agent attempt the same task independently, and a verifier — usually code, sometimes another LLM — picks the best output. "Voting" here is shorthand for the broader generate-N-and-select family, not literal majority vote: in practice the selector is almost always a deterministic check (a test suite, a schema validator, a numerical comparison), and only occasionally an LLM judge or a true majority-of-N tally over discrete answers. Unlike supervisor / worker (specialization), every agent in this topology does the same work. The win comes from the math of independent failures: if a single agent has a 45% pass rate, five independent agents have a 1 − (1 − 0.45)⁵ ≈ 95% chance that at least one succeeds — provided their failures actually are independent and you have a verifier that can pick the success.
This is the multi-agent version of the parallelization pattern from Foundations Module 3, and the gating conditions are sharper than they look. The pattern works for tasks with verifiable success criteria (a test suite, a numerical answer, a structural validator) and independent failure modes. It fails — silently, expensively — for tasks where neither condition holds.
Why the math matters
The core probabilistic insight is one that competitive programmers, ML researchers, and chip-design engineers have all rediscovered: when a task has a high-cost, low-probability successful path, running many trials in parallel beats trying harder on one. Concretely:
- One agent with a 45% pass rate: P(at least one success) = 0.45 = 45%
- Two agents, independent: P(at least one) = 1 − 0.55² = 70%
- Five agents, independent: P(at least one) = 1 − 0.55⁵ ≈ 95%
- Ten agents, independent: P(at least one) = 1 − 0.55¹⁰ ≈ 99.7%
Each additional agent buys diminishing pass-rate lift. Going from 1 to 5 agents almost always pays for itself if the per-task value justifies a 5× token cost. Going from 5 to 10 is much harder to justify — you double the cost for ~5pp of additional pass-rate. Production systems that use this pattern almost always settle at N = 3 to 7.
The same math has been the operating logic of code-generation systems for years — AlphaCode (DeepMind, 2022) generated millions of candidate programs, filtered them against the small set of public example tests the problem statement provides, then clustered the survivors and submitted one representative per cluster. The published 54th-percentile competitive-programming result depended on parallel sampling + a layered selection process, not on a single brilliant attempt — and not on the hidden judge tests, which AlphaCode never saw at selection time. Modern coding agents (Cursor's composer mode, Claude Code, Devin) all expose some flavor of generate-N-and-verify, with the verifier being whatever the runtime can cheaply check.
When the pattern works
Two conditions, both required:
- A verifier exists. Something has to pick the "winning" output deterministically. The verifier can be a test suite (most common), a numerical comparison (does the answer equal 42?), a schema validator (does the JSON parse?), or a different LLM scoring on a rubric (LLM-as-judge — the lossiest verifier, but works for fuzzy criteria). Without a verifier, you have N candidate answers and no way to choose; you've paid N× cost for one random pick.
- Failures are independent. The N agents have to fail in different ways. If all five share the same prompt, same model, same seed family, same tool surface, they'll fail on exactly the same edge cases — the team is one expensive coin flip in disguise. Independence comes from varying something across agents: temperature, prompt phrasing, tool order, sometimes model entirely.
Both conditions failing is common and ruinous. A "multi-agent" design that runs five copies of GPT-5.4 with identical prompts on a chat task with no test suite is paying 5× to get exactly the same answer as one agent — the worst possible cost-quality trade.
When the pattern fails — the silent ways
Two failure modes that look like the pattern is working but aren't:
- Correlated failures dressed as independent. The agents share the same training data, often the same base model, and frequently hit the same blind spots. A famous version: five copies of the same coding agent all fail to handle the same off-by-one because they all internalized the same idiom from training. The appearance of independence (different seeds, different temperatures) doesn't guarantee independence of failure. Real independence usually requires varying something more structural — different model families, genuinely different prompts, different tool surfaces.
- LLM-as-judge that votes for the wrong answer. The verifier is itself a stochastic LLM, and it has its own correlated failure modes. If the judge prefers verbose, confident-sounding outputs, the pattern just selects the most verbose, confident-sounding wrong answer N times. Code execution as a verifier is much sharper because the test suite has no opinions; LLM-as-judge is workable for fuzzy criteria but introduces its own bias surface — see LangChain's LLM-as-judge writeup for the calibration loop required.
The honest read: the pattern is at its most reliable when the verifier is code and the agents are structurally varied. Anywhere else, do the math twice.
What the parallel pattern doesn't need
A useful negative: parallel-and-vote does not need supervisor / worker scaffolding. There's no plan, no decomposition, no synthesis — just N independent attempts and one verifier. This makes it operationally cheaper to ship than the supervisor / worker topology of Step 2, and it composes well: a supervisor / worker system can have one of its workers be a parallel-and-vote sub-team. The two patterns are complementary, not alternatives.
Cost framing
The parallel-and-vote tax is roughly N× per agent plus the verifier cost. For N=5 with a free verifier (test suite), the tax is 5×. For N=5 with an LLM judge that costs as much as one agent, the tax is 6×. The decision question is the same as Step 1's: does the pass-rate lift justify the multiplier?
A useful rule from production: parallel-and-vote pays off when the per-task value is high enough to absorb a 5–10× token bill and the alternative is a manual fallback that costs even more (a human reviewing failures, a customer hitting an error). For a cheap chat task with no fallback cost, the multiplier rarely justifies itself.
Try it: Stage 3 of the right pane runs five agents in parallel against a parse_csv(path) coding task. Each agent has a baseline ~45% pass rate. Two toggles change the dynamic. Verifier ON / OFF — with a test-suite verifier, the verifier picks any agent that passed; without it, the "5 outputs" are useless because you can't tell which is right. Failures CORRELATED / INDEPENDENT — independent failures give the (1 − (1 − 0.45)⁵) ≈ 95% lift; correlated failures collapse all 5 outputs to one coin flip and the lift is zero. Click ▶ Run 5 agents repeatedly under each combination. The math at the bottom shows the calculated probabilities; the agent grid shows the actual sampled outcomes for that run. The pattern only earns its 5× tax when both conditions hold; flipping either toggle off makes it a tax with no benefit.
Handoffs
What is an agent handoff?
A handoff is an explicit transfer of control and state from one agent to another, with a known protocol — what the receiver gets, what the sender guarantees. The handoff is the load-bearing primitive of any multi-agent system that isn't pure parallel-and-vote: every supervisor-to-worker dispatch, every triage-to-specialist routing, every escalation from a junior agent to a senior one is a handoff. The protocol shape — full transcript dump, named field list, validated schema — is the variable. Picking the wrong protocol is how a working multi-agent system silently degrades into one that "mostly works but loses the customer's name halfway through."
The handoff is the multi-agent counterpart of the tool schema discipline from Foundations Module 2: a tool schema is the contract between an agent and a tool; a handoff protocol is the contract between two agents. Both fail the same way — too loose, the receiver hallucinates the missing fields; too rigid, the sender can't express what the receiver needs to know.
The three protocol shapes
Real production multi-agent systems converge on three shapes, in increasing order of structure:
- Full transcript dump. The receiver gets the entire conversation history appended to its prompt. Zero protocol design, zero schema work, but every handoff carries the cumulative weight of every prior turn — payloads grow linearly with the conversation.
- Named fields. The sender extracts a small set of structured fields (
customer_id,intent,charge_id,reason) and passes only those to the receiver. Cheap, sharp, and the receiver's prompt knows exactly what to expect. - Validated schema + summary. Same fields as above, plus a short rationale string and a runtime validator (Pydantic, Zod, JSON Schema). The receiver can sanity-check that the rationale matches the fields and reject malformed handoffs before acting.
The real-world success rate climbs with the structure: roughly 65% on the transcript dump (refund agent gets distracted by chit-chat earlier in the call), 86% on named fields (clean transfer, occasional wrong-field bug), 92% on schema + summary (the rationale + validation catches the wrong-field bug before it ships). Numbers are illustrative — your own measurements will differ — but the shape is robust across teams that have measured it.
Why the transcript dump is tempting and dangerous
The transcript dump is the default for teams that ship a multi-agent system without thinking about the handoff at all. It's what every framework gives you for free if you don't opt into something tighter. It works fine on the first iteration — when conversations are short and agents are few — and starts to fail in three specific ways as traffic scales:
- Token cost explodes per hop. If two agents each carry the full transcript, the transcript is paid twice; with three agents in a chain, three times. A 5k-token transcript at three hops is 15k tokens of redundant payload. This is the multi-agent version of the no-cache cost waste from Module 4.
- The receiver gets distracted by old context. A refund agent that sees the customer's opening "hi, I'm Alex, I've been a customer for years" sometimes weights that signal heavily and approves a refund the policy doesn't allow. The relevant context (the disputed charge) is buried in the middle of the transcript.
- PII / data-handling violations multiply. Every agent in the chain now sees fields it doesn't need to do its job. This is the multi-agent version of the least-privilege principle from the security trifecta — only pass the fields the receiver needs.
The transcript dump is a convenience default; production systems should pick something tighter as soon as they're past the prototype phase.
Field-list and schema-summary as the production pattern
Once a team knows what each agent needs, the named-fields protocol is almost always cheaper, faster, and more reliable. The cost: writing the extraction logic — what fields the sender hands over, in what shape. The benefit: every handoff is one structured payload, the receiver's prompt is shorter, the cost line drops by an order of magnitude.
The schema + summary version adds two things on top of named fields:
- Runtime validation. Pydantic, Zod, or JSON Schema rejects a malformed payload at the boundary — the receiver never executes a tool call with a missing field.
- A short summary string. A 50-100 token rationale ("Customer was charged twice for order O-321 on Jan 12; refund the second charge") the receiver can sanity-check against the fields. If the summary says "refund the second charge" but the
charge_idfield is the first charge, the receiver can flag the mismatch instead of acting on bad input.
The summary string is the multi-agent version of the structured-output validator pattern from Foundations Module 2 — a self-consistency check on the agent-to-agent payload.
What the major frameworks ship
The three production-oriented multi-agent frameworks each formalise the handoff slightly differently:
- OpenAI Swarm ships a minimal
handoff()primitive — the agent declares which other agent it wants to transfer to, and Swarm handles the routing. Default payload is a context object you populate in code. (Swarm is now positioned as an experimental reference; production OpenAI deployments are pointed at the OpenAI Agents SDK, which formalises the same handoff shape with stronger typing.) - LangGraph models handoffs as edges in a state graph, with the shared state object as the payload — the schema lives in the state class itself.
- Crew.ai uses tasks and crews; a handoff is the dependency edge between tasks, and the payload is the previous task's output.
Picking a framework is downstream of picking your protocol. All three support all three protocol shapes; the "framework choice" is mostly about how much ceremony you want to write per handoff.
The handoff trace is what you debug
When a multi-agent system fails, the failure is almost always in a handoff: the wrong agent received the wrong payload, or the right payload was malformed and the receiver guessed at the missing fields. The single most useful diagnostic you can build is a handoff trace — for each task, the sequence of agents touched and the exact payload at each hop.
This is the multi-agent extension of Module 2's span-per-tick trace discipline. A single-agent trace is one chain of spans; a multi-agent trace is a tree, with handoffs as the branching edges. A trace store that doesn't represent handoffs as first-class events makes multi-agent debugging an order of magnitude harder.
Try it: Stage 4 of the right pane runs the same customer query ("I think I was charged twice for order O-321 — can I get one refunded?") through the same three-agent chain (TriageBot → BillingBot → RefundBot) under each of the three protocols. Toggle between Full transcript dump, Named fields, and Validated schema + summary and watch what crosses the agent boundary in the "What actually crosses" box. The transcript dump shows the full chat-style payload at ~5,200 tokens per hop with a 65% success rate; named fields drops to a 4-key payload at ~180 tokens per hop with 86% success; schema + summary adds a rationale string at ~340 tokens per hop with 92% success. Each protocol's failure-mode box names exactly how the failures happen — distractor information, sender extraction bugs, summary/field mismatches caught at validation. The takeaway: protocol design is what determines whether your multi-agent system is reliable or just complicated.
Coordination Costs
What is the coordination tax of a multi-agent system?
The coordination tax is the full set of costs a multi-agent system pays that a single-agent system does not — extra tokens for planning, dispatch, summarization, and synthesis; extra latency from sequential dependencies; extra engineering surface to debug, monitor, alert on, and roll back. It is the "dark cost" of multi-agent design: the topology can deliver a real pass-rate lift on a hard task, and still lose on overall economics if the lift doesn't justify the multiplier. This step is the gating discipline — a four-question filter you walk before committing engineering time to a multi-agent build, and the rough order-of-magnitude tax brackets you should expect at each common topology shape.
The honest framing: most teams that prototype a multi-agent system don't need one. The single-agent setup hadn't reached its eval ceiling, the task didn't actually decompose cleanly, the reported lift was 2pp instead of the 20pp the team imagined, or the unit economics couldn't absorb the bill. The discipline of asking the four questions explicitly — before shipping — saves teams from the most common failure mode of this space: a multi-agent system that "works" in a demo but is more expensive, slower, and harder to debug than the single-agent baseline it replaced.
The four-question gate
Walk these in order. A "no" at any question halts the escalation:
- Is the single agent at its eval ceiling? If you can still lift pass-rate with a better prompt, better tools, longer reasoning budget, or smarter retries, do that first. Multi-agent is the most expensive lever in the toolbox; spend the cheap ones first. (See Module 4 on cost engineering for the cheap levers.)
- Does the task decompose cleanly into specialized sub-tasks? If every "worker" would have the same prompt and tool surface, you don't have a team — you have N copies of one agent (which is the parallel + vote pattern, with its own gating). Real specialization needs different prompts, different tools, sometimes different models per worker.
- Is the expected pass-rate lift > 5pp on a real held-out benchmark? Not a vibe check, not a demo. A measured lift on a benchmark you trust. Sub-5pp lifts almost never justify the 10× token tax. (See Module 5 on offline benchmarks for what a real benchmark looks like.)
- Can the unit economics absorb 10–20× tokens per task? A winning topology that costs 15× tokens still has to fit your $-per-task ceiling. If the user-facing price doesn't support it, defer until model prices drop or pick a smaller team (3 workers instead of 5).
Four yeses earn a multi-agent build. Any "no" means stay single — and revisit the question when the prerequisite changes.
Order-of-magnitude tax brackets
Useful intuition for the cost line, not benchmarks — measure your own:
- Single chat assistant: 1×. One model call per turn. The baseline.
- Tool-using single agent: 2–4×. Each tool call adds a round trip; long agentic tasks (Claude Code, Cursor composer, Devin) sit in this range per user-visible result.
- Supervisor + 2–3 workers: 4–8×. Planning + dispatch + parallel work + synthesis. The cheapest team topology that actually earns its existence.
- Multi-agent research / deep workflows: ~15×. Anthropic's reported ratio for deep-research multi-agent at production scale; the workhorse anchor when budgeting an ambitious multi-agent design.
- Pathological cases: 30–100×. Recursive supervisor / worker (workers spawn workers), parallel + vote with N=20 and an LLM judge, agents-debating-agents. Usually a sign the design has lost the plot.
These are token taxes. The engineering tax — debugging surface, observability burden, runbook complexity — scales superlinearly with the number of agents on the critical path. A 3-agent system is roughly 3× harder to debug than a 1-agent system; a 10-agent system is roughly 30× harder, not 10×, because every pair of agents introduces a possible handoff failure mode.
Where the coordination tax shows up in practice
Three concrete budget lines a multi-agent system pays that a single-agent baseline doesn't:
- Per-handoff payload tokens. Even with the lean named-fields protocol from Step 4, every handoff is paid input tokens by the receiver. A three-agent chain pays the handoff tax twice; a five-agent chain four times.
- Per-worker self-summarization. The supervisor / worker contract that makes synthesis cheap (workers return summaries, not raw outputs) requires each worker to do a self-summarization call. That call is a real model call paid in worker tokens; it's often a smaller model, but it's not free.
- Per-team evaluation cost. A single-agent eval can be one trace per task. A multi-agent eval is one trace per agent per task — your eval set effectively grows by a factor of N. The same is true of offline benchmark cost: a 1,000-task benchmark on a 5-agent team is a 5,000-trace evaluation.
The upshot: the coordination tax is not just on the team itself — it's also on the team's evaluation, observability, and operations. Module 5's eval discipline, Module 6's rollout discipline, Module 7's incident response discipline all get more expensive in their multi-agent variants.
When multi-agent is the right call anyway
Three real-world classes where the four-question gate consistently passes and the multi-agent topology earns its tax:
- Deep research over multiple sources. A task that requires fetching evidence from independent domains (web, internal docs, structured data, code), reasoning per-domain, and synthesizing. The single agent context-collapses; the team specializes and wins by 20pp+.
- Code generation with verifiable success. Cursor, Claude Code, Devin all run some flavor of parallel + vote for hard sub-tasks because the test suite is the verifier. The pattern is at its most reliable when the verifier is code, not another LLM.
- Customer-facing triage with specialist downstream agents. A triage agent that routes to billing, refunds, technical support, etc. The handoff is the load-bearing primitive (see Step 4), and the per-specialist accuracy lift over a generalist is real.
Outside these classes, the burden of proof is on the team proposing the multi-agent design, not on the team defending the single-agent baseline.
What this module ends on, and what comes next
This module has been the multi-agent counterpart to the prior seven modules of this track: when one well-instrumented agent isn't enough, what topologies extend the agent surface, what they cost, and how to know whether to escalate. The discipline carried through all five steps — team only when the gate passes — is what separates production multi-agent systems from the prototype graveyard.
Module 9 (Capstone — Reliability Operations) brings the full track to its operating-model conclusion: SLOs, error budgets, on-call rotations, and runbooks for an agent fleet — single-agent or multi-agent. It is the bridge between the engineering discipline of this track and the day-to-day operations of running an agent product in production. The Foundations track gave you an agent that works; this track has given you the engineering to ship it; the capstone gives you the operating model to run it.
Further Reading
- Anthropic — How we built our multi-agent research system — the deep-research case study and the reported 15× token ratio.
- OpenAI — Swarm framework — the minimal handoff primitive; reference for what an agent-to-agent transfer protocol looks like at code level.
- LangGraph — Multi-Agent Architectures — production-oriented multi-agent topologies with handoffs as graph edges.
- Crew.ai — Hierarchical Process — supervisor / worker topology in a production framework.
- AlphaCode (DeepMind, 2022) — the canonical case for parallel-and-verify in code generation.
- LangChain — Aligning LLM-as-a-Judge with Human Preferences — calibration loop for the verifier-is-an-LLM case.
- Anthropic — Building Effective Agents — the workflow / agent / multi-agent decision frame this module ladders into.
Try it: Stage 5 of the right pane is the four-question gate. Walk down the questions one at a time, clicking Yes or No on each. Any "no" halts at a verdict (STAY SINGLE / NOT WORTH THE TAX) with the specific reason. Four yeses earn the GO TEAM verdict — the only path that justifies committing engineering time to a multi-agent build. Below the gate, the reference token-tax bars (1× single chat → ~15× multi-agent research) give you the order-of-magnitude budget intuition; the Anthropic-cited 15× is the workhorse anchor when planning. Reset the gate and try different paths to see which combinations halt where; the structure makes it explicit which prerequisite is missing in any failed escalation.