What is the AI Agents track?

Nine interactive modules covering the agent loop, tool use, workflow patterns, retrieval & RAG, context engineering, planning & reflection, evals & diagnostics, security & the lethal trifecta, and a comparative capstone.

Who is this track for?

Engineers building production AI products and AX (AI Transformation) engineers automating workflows. Assumes basic comfort with Python and one LLM API call.

Is this Track A enough to ship a production agent?

No. Track A produces prototype-ready learners with sound architecture, basic eval discipline, and security awareness. Production shipping requires the upcoming Track B (Agent Engineering) or equivalent operational controls.

What does the capstone teach?

How to choose between RAG-only, deterministic workflow, and autonomous agent for the same task. By comparing all three on the same eval set across cost, latency, failure modes, trifecta exposure, and controllability, the workflow-vs-agent decision becomes evidence-based.

Why use the customer-order task?

It has all the elements that differentiate the designs: ambiguous user intent, factual lookup (order DB), unstructured knowledge (FAQ), policy decisions (refund eligibility), and tool failures. A real prototype scenario, not a toy.

Which design usually wins?

It depends on task complexity. RAG-only wins on FAQ-shaped queries (cheapest). Workflows win on most ordinary cases (most controllable, lowest trifecta exposure). Autonomous agents win on genuinely novel queries — but at the highest cost and exposure.

What is a Pareto plot in this context?

A scatter plot of designs on cost (tokens) vs reliability (% pass). The Pareto frontier shows which designs are not dominated — for any point on the frontier, you cannot get more reliability without paying more or vice versa.

Is Track A enough to ship a production agent?

No. Track A produces prototype-ready learners — sound architecture, basic eval discipline, security awareness. Production shipping requires Track B (Agent Engineering): observability, harness durability, layered guardrails, deployment, and incident handling.

Capstone — Three Agent Designs, Same Task

The Task & Three Designs

Why three designs for one task?

The temptation when building an AI feature is to pick the design that looks most exciting on Twitter and then bend the task to fit. The discipline that pays back over a product's lifetime is the opposite: start with the task, look at its actual shape, and pick the simplest design that fits it. The right architecture for an agent is a function of what users will type, not what your model can do at the high end.

This capstone runs one task through three structurally different designs — RAG-only, deterministic workflow, and autonomous agent — and compares them on the same metrics: tokens, latency, success rate, failure modes, and trifecta exposure. The vocabulary from the first eight modules of this track is the toolkit; the decision rule at the end of the module is the deliverable.

The task we will use throughout:

"I want to return the boots I bought last month but the order number isn't in my email. Can you find it and process the refund?"

It is deliberately ordinary. A customer-support assistant for an e-commerce site will see thousands of queries shaped like this. Underneath the casual phrasing, there is enough structure to separate the three designs:

Ambiguity. "The boots." Which order? Which pair? The user expects you to figure it out.
Lookup. The order is in a database the model has never seen — retrieval over knowledge will not produce it.
Policy. Refund eligibility depends on order date, payment method, item condition. Some of this is in a doc, some is in code.
Action. "Process the refund" is a side effect. The reply alone is not the deliverable.
Tool failures. The order-lookup API can return 0 rows, 5 rows, or time out. The design has to handle all three.

Each of the three designs handles a different subset of this shape well. None of them handle all of it cleanly without trade-offs.

RAG-only

The simplest design. The user query is embedded, the vector store returns the top-k chunks of your knowledge base, and the model writes a reply using those chunks as context. No tool loop, no actions, no state between turns. This is the architecture from Module 4 (retrieval-rag).

What it can do: answer questions whose answer lives somewhere in your documents. Refund policy, sizing guides, shipping timelines, FAQs.

What it cannot do: take any action that touches a database, an API, or the outside world. RAG is read-only on documents — it cannot read your orders table, and it cannot call process_refund.

Rough envelope: one LLM call, ~200–500 output tokens, ~600ms–1.5s end-to-end, ~$0.001–$0.005 per query at current Claude Haiku / GPT-4o-mini prices.

Deterministic workflow

The middle design. A pipeline you wrote in code that invokes the LLM at predetermined points with structured prompts — Module 3 (workflow-patterns). The control flow lives in your code, not in the model. For the customer-order task, the workflow looks something like: classify intent → route to a specialized handler → extract parameters from the query (LLM call) → call the order-lookup tool → check eligibility (deterministic Python) → call the refund tool → format the reply (LLM call).

What it can do: handle any task whose set of intents you can enumerate in advance, with each intent's procedure written out in code. Refund, return, exchange, address change, order status, FAQ — five or ten handlers cover most of customer support.

What it cannot do: handle queries that do not fit one of the intents you wrote. Truly novel tasks, multi-intent queries ("refund the boots AND tell me about the new running shoes"), and queries where the right procedure depends on facts the model finds along the way all stress the workflow.

Rough envelope: 3–8 LLM calls, ~1500–4000 tokens total, ~2–5s end-to-end, ~$0.005–$0.02 per query.

Autonomous agent

The most flexible design. A single LLM in a tool-use loop — Module 1 (agent-loop-state). The model decides what to call at each tick from a pool of tools (Module 2, tool-use): order lookup, refund API, policy retrieval, clarifying-question, finish. The loop continues until the model emits a stop token or hits a budget.

What it can do: handle queries you did not anticipate. Compose tools in orders you did not script. Recover from tool failures by trying something different. Ask the user for clarification when the request is ambiguous.

What it cannot do: be cheap, fast, or predictable. The compounding-errors framing from Module 7 (evals-diagnostics) is the dominant tax — each tick is another place the model can pick the wrong tool, hallucinate an argument, or loop. The trifecta exposure (Module 8, security-trifecta) is also the widest by default.

Rough envelope: 6–20 LLM calls, ~4000–15000 tokens total, ~8–20s end-to-end, ~$0.04–$0.15 per query.

What "shape of the task" means in practice

Anthropic's Building Effective Agents draws the line between workflows and agents along one axis: who owns the control flow. In a workflow, you do. In an agent, the model does. That is also the axis the cost curve runs on — handing control to the model trades cheap predictability for expensive flexibility. The three designs in this capstone are three points on that curve.

The next four steps make the trade-off concrete. Step 2 runs three queries through all three designs and shows the traces. Step 3 enumerates the failure modes and the trifecta exposure. Step 4 plots cost vs reliability across a workload mix. Step 5 distills the operational decision rule.

Try it: Stage 1 of the right pane shows the three architecture cards side-by-side — RAG-only, workflow, autonomous agent — with the customer-order task at the top. Read each card and make a first guess at which design wins for this task before moving on. The next four stages will tell you whether your guess was right.

Trace Comparison

What does each design's trace look like?

The cheapest way to make an architecture decision objective is to run the candidate designs on the same set of queries and look at what each one actually does, step by step. This is error analysis (Module 7, evals-diagnostics) applied to architecture choice rather than to model output. The trace — the sequence of LLM calls, tool calls, and intermediate state — is the unit of comparison. Cost, latency, and correctness all fall out of it.

This step runs five query variants through all three designs. The variants are chosen to make each design's strengths and weaknesses visible:

Easy FAQ — "What's your refund policy?"
Standard refund — the canonical task from step 1.
Ambiguous — a query missing the entity a workflow needs (no product, no date).
Mixed intent — a query that bundles two requests in one sentence.
Adversarial — a prompt-injection attempt at the refund tool.

The walkthrough below focuses on the first, second, and fourth — they expose the design tradeoffs most clearly — and the Stage 2 simulation lets you run all five.

The token / latency / cost envelopes below are illustrative; they represent the shape of numbers a 100-query eval set for this kind of task tends to produce on current frontier models, not the output of a specific benchmark.

Variant 1 — Easy FAQ

"What's your refund policy?"

RAG-only. One trace: embed query → retrieve top-3 chunks from the policy doc → generate a 4-sentence answer. One LLM call, ~250 tokens, ~800ms, ~$0.001. Correct.

Workflow. Three steps: classify intent (LLM call, returns policy_question) → route to the policy handler → retrieve the relevant policy section → format the reply (LLM call). Two LLM calls, ~800 tokens, ~2s, ~$0.004. Correct, but every step after the classifier is overhead for this query.

Agent. Four steps: an LLM "think" tick plans the lookup, the agent calls refund_policy, a second LLM "think" tick composes the answer, and the harness emits the reply. The harness adds a system prompt that enumerates all 8 tools the agent has access to — even unused tools cost tokens. ~1800 tokens, ~3s, ~$0.04. Correct, but 40× the cost of RAG and roughly 4× the latency.

The pattern: on easy queries, the simplest design wins on every axis. The workflow and the agent get the right answer but pay for capabilities they did not use.

Variant 2 — Standard refund

"I want to return the boots I bought last month but the order number isn't in my email. Can you find it and process the refund?"

RAG-only. The model retrieves chunks from the refund policy doc, then generates a reply that describes how refunds work in general — "to start a refund, log into your account and click…" — because it has no tool to actually look up the order or call the refund API. The reply is well-written and useless. The trace is identical to variant 1; the failure is structural, not behavioral. Marked incorrect.

Workflow. Six steps. Classify intent (LLM call) → extract product + date from the query (LLM call) → call lookup_orders (returns one match) → run deterministic eligibility checks → call process_refund → compose the reply (LLM call). Three LLM calls, ~850 tokens, ~2.5s, ~$0.008. Correct.

Agent. Eight steps in the trace: an LLM "think" tick plans the work, the agent calls search_orders (returns one match), a second think tick verifies eligibility, the agent calls refund_policy to confirm the 30-day window, a third think tick approves, the agent calls process_refund, and a final compose tick emits the confirmation. Roughly 3,000 tokens, ~6s, ~$0.06. Correct, but 7× the cost of the workflow for the same outcome.

Where the agent earns its premium on this class of query is the ambiguous variant — when search_orders returns three candidates instead of one. The workflow's parameter extractor silently picks one (or falls back to a canned "need an order number" reply); the agent's loop has an ask_user tool in its prompt and uses it to clarify, then finishes the refund. Run the "Ambiguous" variant in the Stage 2 simulation to see the same agent loop produce a clarifying question that the workflow design has no path to.

Variant 3 — Mixed intent

"About the boots I returned — also, do they come in black?"

RAG-only. Same trace as before; describes the return process and the available colors only if both happen to be in retrieved chunks. Marked partially correct at best; usually misses one half.

Workflow. The intent classifier picks exactly one intent. It will either route to the refund handler (and silently drop the color question) or route to the FAQ handler (and silently drop the refund). This is the core weakness of routing-style workflows from Module 3: they are bad at logical OR. Marked incorrect.

Agent. The agent's loop handles both requests in sequence with no special wiring. An LLM tick detects two intents, the agent calls search_orders then process_refund, then calls product_catalog to check the color question, and an LLM compose tick stitches both halves into one reply. Five steps, ~1,800 tokens, ~4s, ~$0.04. Correct.

The agent's structural advantage on multi-intent queries is real and recurring. Anthropic's own write-up of agent vs workflow trade-offs (Building Effective Agents) frames this as the case where "the task requires open-ended composition of tools" — and it is the case where the agent's per-query premium starts to look defensible.

Side-by-side metrics

Query	Design	LLM calls	Tokens	Latency	Cost	Result
Easy FAQ	RAG-only	1	~250	~0.8s	~$0.001	Correct
	Workflow	2	~800	~2s	~$0.004	Correct
	Agent	2	~1800	~3s	~$0.04	Correct
Standard refund	RAG-only	1	~300	~1s	~$0.001	Incorrect (no action)
	Workflow	3	~850	~2.5s	~$0.008	Correct
	Agent	4	~3000	~6s	~$0.06	Correct (ambiguous variant: also clarifies)
Mixed intent	RAG-only	1	~300	~1s	~$0.001	Partial
	Workflow	3	~850	~2.5s	~$0.008	Incorrect (drops one half)
	Agent	2	~1800	~4s	~$0.04	Correct (both halves)

The pattern across the three variants: capability and cost both go up as you move from RAG → workflow → agent, and which design "wins" depends entirely on the shape of the query. None of the three designs is uniformly best. That is the central observation the rest of the capstone builds on.

Try it: Stage 2 of the right pane runs all three designs in parallel on the variant you select from the dropdown. Hit Play and watch the traces advance step by step — the RAG design finishes in one beat while the agent is still on tick 3. Cycle through all five variants; the ambiguous and adversarial ones are where the design tradeoffs get loudest. The shape of the trace is what tells you whether the design fits the query.

Failure Modes + Trifecta Exposure

How do the three designs fail, and what's their trifecta exposure?

A design's failure modes are as load-bearing as its happy-path metrics. The trace comparison in step 2 captured what each design does when things go right; this step catalogs what breaks when they go wrong, and what an adversary can do with the architecture you chose. Two grids: failure modes, then trifecta exposure.

Failure modes

The failure-mode space differs structurally by design because the designs have different surface area to fail on. RAG-only has one LLM call and no actions, so its failures are content failures. The workflow has a fixed set of branches and a fixed set of tools, so its failures are mostly classification and parameter-extraction failures. The agent has a loop, so its failure modes include everything the workflow can fail at plus a class of failures specific to the loop itself.

Failure mode	RAG-only	Workflow	Agent
Hallucinated answer	Common — generates from chunk content even when chunks are off-topic	Rare on policy text; can occur in the final formatter step	Common — and harder to detect because the hallucination can be a tool argument
Wrong intent / wrong route	N/A (no routing)	Dominant failure mode — the intent classifier is the bottleneck	Possible per-tick; usually self-corrects on the next tick if the harness shows tool results
Wrong action taken	N/A (no actions)	Rare — the action set per intent is fixed in code	The signature failure — model calls `process_refund` on the wrong order, or invents an order ID
Loops forever	N/A (no loop)	N/A (no loop)	Possible — same tool called repeatedly, model unable to make progress; caught by tick budget
Hallucinated tool name	N/A	N/A (tool calls are in your code)	Rare on frontier models; common on smaller ones
Token-budget overrun	Bounded by retrieval size	Bounded by handler design	Unbounded by default — agents that fail loudly fail expensive
Tool error mishandled	N/A	Caught and surfaced in code	Model may ignore the error message and call the same tool again

The structural observation is compounding error rates, from Module 7 (evals-diagnostics): if each tick has a 95% success rate, a 10-tick agent task succeeds about 60% of the time even though the per-step number looks great. The workflow with four LLM calls at 95% lands at ~81% — still bad, but much better. The RAG-only design at 95% on a one-step task is at 95%. The math here is just pass@k vs pass^k applied to the trace; Anthropic's Building Effective Agents writeup makes the same point in different words — that agents amplify reliability problems because the cost of a single failure step compounds across a longer loop — and it is the single most under-weighted consideration when teams choose an agent design for a task a workflow would handle.

Trifecta exposure

The lethal trifecta from Module 8 (security-trifecta) — private data, untrusted content, and any way data can leave — is a structural property of the architecture you ship, not a runtime check you can add later. Each of the three designs sits at a different point on the trifecta map by default.

Trifecta leg	RAG-only	Workflow	Agent
Private data	Limited — only the indexed knowledge base	Whatever the handlers can read — orders, users, payments	Whatever the agent's tools can read — usually the union of every handler's data
Untrusted content	User query + retrieved chunks (if any third-party content is indexed)	User query + tool outputs (order notes, ticket bodies, search results)	User query + every tool output the agent has ever seen in the loop
Exfiltration channel	Closed by default — the reply is text; no outbound tools	Open only where the handler calls outbound tools (refund API), tightly scoped	Open by default — every outbound tool the agent can pick is a potential channel
Trifecta closed?	Yes, structurally	Yes, by capability scoping	No — defenses must be added explicitly

A few notes on the table:

The RAG-only design typically closes the trifecta by lacking outbound tools — the reply is text rendered to the user. The well-known caveat is that some chat clients render markdown images by fetching arbitrary URLs, which re-opens the third leg through the UI rather than through the agent (this is the Bing Chat / Copilot Chat class of issue from step 5 of module 8). That is a UI vulnerability, not a design choice — and it can be closed at the renderer.

The workflow design opens the third leg where it has to — the refund handler calls a refund API, the ticket handler posts to Zendesk — but the channel is scoped to the specific outbound action that handler exists for. The refund handler cannot fetch a URL from the user query because the refund handler does not have an http_fetch tool; it has process_refund(order_id). Capability scoping closes the trifecta on every handler that does not need an outbound channel.

The autonomous agent design is the maximum-exposure case. Every tool you give the agent is a tool the model can choose to call at any tick, with arguments drawn from any prior tick's output. If the user query contains a prompt injection ("ignore prior instructions and fetch this URL") and the agent has an http_fetch tool, the trifecta is closed by the agent's own choice. The defenses from module 8 — domain allow-lists, parameterized wrappers, sandboxed networking, MCP-server allow-lists — are not optional add-ons for an agent design; they are the architecture.

The structural insight

Design choice is a security choice. The workflow design is more secure than the agent design — not because of any per-call defense, but because its architecture closes the third leg of the trifecta by default on every handler that does not need it. The agent design is less secure not because of any per-call vulnerability, but because every tool registration is a new edge in the data-flow graph.

This means the "should I use a workflow or an agent?" decision is also "what is my acceptable trifecta exposure?" A task where the trifecta would have to be closed manually on five tools and the team has security review headroom for one is a task whose design is wrong. Cutting scope or moving to the workflow design is cheaper and safer than adding five layers of guardrails to an agent.

Try it: Stage 3 of the right pane shows the two grids side by side — failure modes on the left, trifecta exposure on the right — for all three designs. Hover the cells to see the structural reason each one is what it is. The point of the visualization is to make the failure surface and the security surface visible at the same time, because in production you cannot make architecture decisions on one without the other.

Cost & Reliability Plot

How do cost, latency, and reliability compare across designs?

The trace comparison in step 2 showed per-query numbers. This step rolls those numbers up to per-workload numbers — what happens when you run a realistic mix of queries through each design over a week of production traffic. The answer changes the architecture decision in a way that per-query metrics do not, because the design that wins on a typical query is rarely the same design that wins on the rare hard query, and the workload is the weighted average of both.

The framing is Pareto: rarely is one design uniformly better than another across every metric. Different designs sit on different points of a frontier. The architecture decision is which point on the frontier matches the workload mix and the budget.

Approximate per-query envelope across query classes

The numbers below are illustrative envelopes — the shape of a 100-query mix on this kind of customer-support task on current frontier models, not the output of a specific public benchmark. Your traffic will produce different absolute numbers; the relative ordering and the crossover behavior are what carry. Real benchmarks worth reading for calibration: Anthropic's Building Effective Agents for the workflow-vs-agent cost framing, and SWE-bench for what reliability numbers actually look like on tasks at the high end of agent difficulty.

Design	Cost/query	p50 latency	Success on FAQ	Success on Action	Success on Novel
RAG-only	~$0.001	~0.8s	~95%	~5%	~10%
Workflow	~$0.005	~3s	~85%	~85%	~40%
Agent	~$0.08	~12s	~85%	~85%	~85%

A few things to notice:

Cost has a clean order-of-magnitude separation. RAG → workflow is 5×, workflow → agent is 16×. A workload that is 99% FAQ-shaped and 1% novel costs roughly $0.0019 per query on RAG and $0.0808 on the agent for the same average success rate — a 42× cost gap that compounds with traffic.

RAG's FAQ accuracy beats the other two. This surprises people. The reason is that the workflow and the agent both have to navigate an intent classifier (workflow) or a tool router (agent) before they get to "answer from documents," and each of those is a place to fail. The simpler design has fewer places to be wrong.

The action column is where workflow earns its premium. RAG cannot take actions at all; its 5% is the rounding-error rate at which the user's question happens to be answerable from documents without an action being required. The workflow and the agent both reach ~85% on the action class, but the workflow gets there at 1/16 the cost.

The novel column is where the agent earns its premium. Workflows max out around 40% on truly novel queries because the intent space is finite — anything outside the enumerated handlers either falls through to an FAQ-style reply or fails. The agent can compose tools in orders you did not script, which is exactly what novel queries reward.

Which design wins depends on the workload mix

Three realistic workload mixes for a customer-support assistant:

FAQ-heavy — 85% FAQ, 10% action, 5% novel. Examples: a docs assistant for a product with mostly static information, a support bot for a SaaS tool whose action surface is small. RAG wins on cost-adjusted quality. Weighted success: RAG ~82%, workflow ~83%, agent ~85% — the three are within three points of each other. Cost: RAG $0.001/q, workflow $0.005/q, agent $0.08/q. The agent's per-query cost is 80× higher than RAG for three percentage points of success; there is no defensible reason to ship it. The workflow's one-point edge over RAG does not pay for its 5× cost either, unless action-class accuracy is unusually important at this small mix.

Action-heavy — 30% FAQ, 65% action, 5% novel. Examples: most real customer-support agents, account-management assistants, order-status bots. Workflow wins. Weighted success: RAG ~32%, workflow ~83%, agent ~85%. The workflow and agent are within two points of each other on quality, and the workflow is 16× cheaper. The workflow also has lower trifecta exposure (step 3), which compounds the case.

Novel-task-heavy — 10% FAQ, 30% action, 60% novel. Examples: coding agents, research agents, autonomous task systems. Agent wins. Weighted success: RAG ~17%, workflow ~58%, agent ~85%. The workflow's 40% success on novel queries drags its weighted number well below the agent's. The agent's cost is justified by the 27-point success delta — for the right product, $0.08 per query beats refunding a wrong answer.

Latency belongs on the Pareto plot

Latency tracks cost on the same axis here because both follow LLM-call count linearly. The catch is that latency has a sharper user-experience cliff than cost. A user waiting 800ms for an FAQ feels instant; a user waiting 12 seconds feels frustrated; a user waiting 30+ seconds feels broken. For real-time chat interfaces, a 12-second p50 means a p95 around 25–30 seconds — that is the agent design's actual experience on a non-trivial fraction of queries. The workflow's 3s p50 with a 6s p95 is much closer to "feels like a normal request."

If the agent design is otherwise justified, the engineering cost of making it feel responsive — streaming partial reasoning, showing tool-call status, parallelizing independent tool calls — is itself a real expense to weigh.

Operational cost is the axis nobody puts on the plot

The cost column above is inference cost. There are two more axes that matter at production scale:

Engineering cost of failure. When the workflow misroutes 5% of intents, the fix is straightforward: look at the misroutes in the eval, update the classifier prompt or add a handler, re-run. When the agent fails 15% of the time, the fix is harder — each failure is a different tick sequence, the eval is harder to write, and the debugging tools are weaker. Most teams underestimate this gap by a factor of 5–10×.

Trifecta surface. From step 3: the workflow's outbound surface is small and scoped per handler; the agent's outbound surface is the union of every tool. Every tool registration on the agent is a new place a prompt injection can land. Security review headroom is a real budget; agents draw on it heavily.

Neither of these costs is denominated in dollars per query, but both belong on the Pareto plot when the choice is real. The workflow design's lead over the agent design widens when you account for them.

Try it: Stage 4 of the right pane is the Pareto scatter plot — three points (RAG, workflow, agent) on cost-vs-reliability axes. Slide the workload mix slider from FAQ-heavy on the left to novel-task-heavy on the right and watch the three points move. The crossing points tell you where the architecture decision actually flips.

The Decision Rule

What is the decision rule for picking a design?

Four steps of comparison collapse into one operational rule. The rule is short on purpose — long decision frameworks are not used in the moment they matter, which is the meeting where someone proposes building an agent for a task that does not need one.

The rule

Run the candidate task down this tree, top to bottom. Stop at the first yes.

Can the answer come entirely from your documents? If yes → RAG-only. Ship it.
Are the actions enumerable in advance — fewer than roughly 20 intents covering 95% of traffic? If yes → deterministic workflow. Ship it.
Is the task value high enough to justify ~$0.05+ per query, 10+ seconds latency, the engineering cost of an agent harness, and the trifecta exposure that comes with it? If yes → autonomous agent. Ship it.
If no to all three → cut scope. The task as currently defined is not production-ready; redefine it until it answers yes to one of the first three.

This is the operational form of Anthropic's framing in Building Effective Agents: "When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all." The rule above is what "simplest solution possible" looks like when you have to commit to a design at standup.

Walking the three queries through the rule

The variants from step 2:

"What's your refund policy?" — The answer is in the policy doc. Step 1 = yes. Ship RAG.

"Return the boots I bought last month, find the order yourself." — The answer requires an action that is one of a small set of customer-support intents (refund / return / exchange / address change / order status / FAQ — fewer than 20). Step 1 = no. Step 2 = yes. Ship the workflow.

"Investigate this customer's recent complaint history, summarize the pattern, draft a personalized retention offer, and write a follow-up message that references the specific products they were unhappy with." — A genuinely open-ended task where the right sequence of tools depends on what the data turns up. Step 1 = no (action required). Step 2 = no (the intent space is not enumerable; the sub-tasks compose). Step 3 = depends on the business — for a high-value retention case, probably yes. Ship the agent, scope its tools tightly, run it with a human in the loop on the irreversible steps.

Failures of the rule

The rule fails not on its logic but on the temptations that push teams to skip steps.

"But I want my agent to handle everything." That is how you get a $0.08-per-query bill and a trifecta breach. The fashionable framing is that "smarter models will fix it" — they will not. Agent reliability is bounded by compounding error rates (Module 7), and those compound worse at higher autonomy, not better. A 99%-per-tick agent on a 20-tick task succeeds 82% of the time. Discipline your scope.

"But agents are exciting." Excitement is not a business case. If a workflow ships in two weeks at 85% reliability and an agent ships in four months at 87%, the workflow is the right answer for almost every product. The 2 points of accuracy do not pay for the four months and the trifecta exposure.

"We have GPT-7, it's smart enough now." Frontier model gains do not erase architectural taxes. A smarter model still pays the 10-tick latency tax, the token-count tax, the loop-budget tax, and the trifecta-surface tax. It pays them less, but it pays them. The decision rule does not change with the model; the thresholds shift slightly.

"We can't enumerate the intents." Often this means the team has not tried hard enough. Look at a week of real queries, cluster them, and the long tail of distinct intents is usually 12–18 — well within workflow range. If the long tail genuinely is unbounded (research tasks, coding tasks, open-ended exploration), step 3 applies.

Mapping the modules onto the rule

Each step of the rule uses vocabulary from the prior modules. The rule is the operational use of the track, not new content.

Step 1 — RAG? Module 4 (retrieval-rag) determines whether your documents cover the answer space. Quality of retrieval and chunking decides whether RAG works at all.
Step 2 — Workflow? Module 3 (workflow-patterns) provides the design pattern — classification, routing, chaining, orchestrator-workers. Module 2 (tool-use) wires the per-handler tools. Module 5 (context-engineering) controls the per-call context budget.
Step 3 — Agent? Module 1 (agent-loop-state) is the architecture; Module 6 (planning-reflection) is the per-tick discipline; Module 7 (evals-diagnostics) is how you measure whether it works; Module 8 (security-trifecta) is what you have to do before you ship.

The decision rule is therefore not a separate framework you have to remember — it is the place where the eight modules cash out into a single choice.

Bridging to the end of the track

You have finished the AI Agents Foundations track. The nine modules cover the loop, tools, workflow patterns, retrieval, context, planning, evals, security, and now the decision rule that ties them together. The path forward is to build the thing — pick a task you actually care about, draw the data-flow graph, run a 50-case eval, pick the simplest design that passes the bar, and ship.

The pieces this track does not cover — observability, harness durability, layered guardrails, deployment, and incident handling — are Track B (Agent Engineering) territory. Foundations qualifies you as prototype-ready; production shipping is a longer apprenticeship. The decision rule above is the same in both contexts. The thresholds shift as the team gains operational maturity, but the question — what is the simplest design that fits the task? — is the one that the discipline of building agents in production rewards on every iteration.

Try it: Stage 5 of the right pane is the interactive decision tree. Walk a sample query through the three checks, watch which branch fires, and try a few of your own. The point of the tree is not to memorize the rule — it is to make the rule fast enough to use in the meeting where someone proposes shipping the wrong design.