Learn AI VisuallyTracksAI Explained

Agent Planning & Reflection Patterns

The Single-Shot Failure

What is the single-shot failure?

A language model that answers without a plan is committing to its first interpretation of the task — before checking any facts, before verifying scope, before observing any tool output. That is what single-shot execution means: one pass, no course-correction. For simple lookups the approach works fine; for anything that requires multiple dependent actions the failure rate climbs sharply, because a wrong assumption at step one poisons every step after it. Anthropic's Building Effective Agents draws the same distinction between workflows (deterministic pipelines) and agents (loops that adapt): the adaptability only helps if the loop has something to adapt on.

Why does planning change the outcome?

A plan forces the model to decompose a task before touching any tools. That decomposition is not magic — it is just more tokens before the first action. But those tokens let the model surface assumptions it would otherwise act on silently.

Consider the booking task in the simulation: Book a 60-minute meeting Tuesday afternoon with Alex (Tokyo) and Priya (Berlin), respecting both calendars and 9-6 working hours in each timezone. Send the invite.

Single-shot output commits immediately:

"I'll schedule a 60-minute meeting Tuesday at 2:00 PM. Alex's calendar looks clear (assuming afternoon is free). Priya is in a different timezone but 2 PM should work. Sending invite now to both attendees."

Three red failures follow: wrong timezone, calendar unchecked, premature send. The model answered with confidence because it never stopped to discover what it did not know.

The planned path runs differently:

  1. Convert 9–6 working windows to UTC for both timezones
  2. Read Alex's calendar — check Tuesday afternoon Tokyo time
  3. Find the overlap window between Tokyo 9–6 and Berlin 9–6 → Tue 09:00–10:00 UTC
  4. Send the invite at Tue 09:30 UTC (10:30 AM Berlin / 6:30 PM Tokyo)

Same model, same task, roughly three times the tokens. The plan does not add intelligence — it adds structure, and structure surfaces the facts the single-shot approach assumed away.

Why the plan is the time-to-think

Plan-then-execute is a name for the two-phase pattern: spend tokens on a plan, then execute each plan step as a separate tool call. The planning phase is not a separate model or a separate prompt; it is the same generation mechanism, just earlier in the trace. Chain-of-thought (CoT) reasoning works for exactly the same reason — it places more tokens before the answer, giving the model more surface area to catch its own errors before committing. (For the generation-level view of how those tokens are produced, see Generation.)

The second scenario in the simulation makes the same point for a code task: Refactor 12 React files to use the new useAuth hook, replacing all legacy getSession() calls. Ensure no files are missed and tests still pass.

Without a plan: three files updated, nine skipped, no test run. With a plan: grep all call sites first, categorise by import pattern, apply to all twelve, run the suite. Completeness comes from the plan step that inventoried the scope — not from the model being smarter.

What planning does not fix

Planning helps with agentic failures — wrong scope, unchecked assumptions, premature action. It does not help with knowledge failures: if the model does not know that Tokyo is UTC+9, a plan will not rescue it. It also does not help if the plan itself is wrong from the start. Later steps in this module cover the complementary mechanisms: pausing after each tool call to observe the result (Step 3), reflecting on failure before retrying (Step 4), and knowing when to stop (Step 5). The next step — reasoning budget — addresses the economics of planning: when the extra tokens are worth spending.

Try it: Hit Play in Stage 1 and watch the planned trace build step by step. Switch the scenario toggle between "Team meeting" and "Auth refactor" to see how both tasks follow the same pattern: the single-shot column fails with red markers, the planned column succeeds by materialising the assumptions before acting on them.

Step 2 frames the planning cost as a dial — not every task warrants the same depth of reasoning.

When to Spend More Tokens

What is a reasoning budget?

A reasoning budget is the cap you put on how many tokens an agent spends thinking before answering. The dial runs from zero (direct answer) to tens of thousands (deliberate multi-attempt reasoning).

When does spending more tokens help?

More reasoning is not free, and it is not always worth it. The reasoning budget is the number of tokens you allocate to deliberation before the model commits to an answer — and like any budget, the right amount depends on what you are buying. For a lookup task the payoff is nearly zero; for open-ended research the curve keeps climbing even at 25,000 tokens of reasoning. The question every agent designer must answer is not "should I use reasoning?" but "how much reasoning does this task class justify?"

The four task classes

The simulation in Stage 2 plots accuracy against a four-stop budget scale (0 → 1k → 5k → 25k tokens) across four task classes. The pattern is consistent: lookup tasks peak early, complex tasks peak late, and the curves have very different slopes.

Lookup — "What's the current S&P 500 close?" — scores 95% with zero reasoning tokens. Adding 25× the budget buys one additional point of accuracy. This is not a reasoning problem; it is a facts problem. The model either has the fact or it does not, and CoT cannot conjure information that was not in the training data.

Arithmetic (multi-step) — "Compound interest on $5k @ 6% for 23 years" — starts at 60% and climbs to ~78% at 25k. Most of the gain comes in the first 1k–5k range (brief planning catches carry errors); gains flatten after that. 5k is the practical ceiling — the additional 20k tokens return less than 5 accuracy points.

Code debug — "Find the race condition in this 200-line server" — starts at 25% and climbs steeply to ~78% at 5k. The gain here is earned by reasoning through branching control flow, which is exactly the kind of structured deliberation CoT helps with. The curve flattens above 5k, so the sweet spot is ~5k: enough budget to reason carefully, not enough to spiral into over-analysis.

Agentic research — "Compile competitive analysis from 3 sources, deduplicate, summarize" — starts at 18% and reaches 88% at 25k while still climbing. This class keeps returning on every extra token because the problem itself is open-ended: there is always another source to consult, another conflict to resolve, another angle to include.

All four curves in the simulation are illustrative LAV-curated curves — real accuracy curves depend on the model, prompt, and specific task. Not benchmarks. The teaching is the shape, not the numbers.

How modern reasoning models change the framing

Systems like OpenAI's o1 and o3 and Anthropic's extended thinking internalize the CoT. The user-facing artifact is the answer; the reasoning chain is generated but may be hidden or summarised. This does not change the underlying economics — it just automates the budget decision. The model learns (from reinforcement training) how much deliberation each task class tends to need and spends accordingly. For the agent designer, the practical upshot is that you may not directly dial a token count; instead you set a max-thinking-budget or choose a reasoning preset, and the model allocates within that.

CoT, extended thinking, and the planning phase from Step 1 are all mechanically the same thing: tokens generated before the final answer that give the model more surface area for self-correction. The generation mechanism is identical to any other token prediction — it is just earlier in the trace. (Generation covers the per-token sampling loop that produces all of them.)

Reasoning is a cost-vs-accuracy decision

Treat the reasoning budget as a dial, not a switch. The right position depends on three factors:

  • Task class — is accuracy still climbing at higher budgets, or has it already flattened?
  • Latency — reasoning tokens add wall-clock time; a user-facing request often cannot afford 25k tokens of internal deliberation.
  • Cost — reasoning tokens are billed just like output tokens on most APIs; the per-call cost of agentic-research scale reasoning is meaningfully higher than a brief plan.

A common practical rule: start at 1k, check the task class, raise to 5k if the class benefits and the use case tolerates the latency, and only go to 25k+ for async, non-latency-sensitive, genuinely open-ended research tasks.

Try it: In Stage 2, select each task class and step through the budget stops left-to-right. Watch the accuracy dot move along the curve and read the insight panel that explains where the gains come from and where they stop. The faded lines for other task classes stay visible — they make the shape comparison concrete.

Step 3 turns from the planning budget to what happens between actions: the pause-to-observe beat that lets an agent change its mind based on what a tool actually returns.

When to Pause and Observe

What is ReAct?

ReAct (Reason + Act) is the loop where an agent interleaves Thought, Action, and Observation on every turn. The forced Observation step lets the agent change its mind based on tool output instead of committing to a plan made in ignorance.

Why does pausing to observe matter?

An agent that issues actions without reading their results is flying blind. Every tool call produces output — a search result, a file read, an API response — and that output may confirm, refute, or redirect the plan the agent was following. The Observation beat in ReAct (Yao et al. 2023) is the forced pause between a tool call and the next thought. That pause is what gives the agent a chance to change its mind.

ReAct stands for Reason + Act. The loop has three alternating steps: Thought (the model reasons about the current state), Action (the model calls a tool), and Observation (the tool's output is read back into context). Then the cycle repeats. The insight in Yao et al. is that the Observation step is not just bookkeeping — it is the mechanism by which the agent stays grounded in the world rather than in its own assumptions.

The Anthropic HQ task

The task in the Stage 3 simulation is: "What's the population of the city where Anthropic is headquartered?"

On the surface it looks like a two-step query. But watch what happens when the Observation step is stripped away — the "Without observation feedback" mode.

Without observations, the agent collapses to:

  1. Thought — Anthropic is a big AI company — probably HQ'd in NYC like most tech firms. I'll search NYC population.
  2. Action — web_search("NYC population")
  3. Final Answer — About 8.4 million. ✗ Wrong city — agent never looked up HQ.

The agent assumed Anthropic is in New York City and confidently returned the NYC population. The answer is wrong — off by 10× — because the city assumption was never verified. This is the failure mode that observation prevents: the agent committed to a wrong premise at step 1 and had no mechanism to discover it before committing to a final answer.

The full ReAct trace confirms the HQ before acting on it:

  1. Thought — I need to find Anthropic's HQ first, then look up its population.
  2. Action — web_search("Anthropic headquarters location")
  3. Observation — Result: Anthropic is headquartered in San Francisco, CA.
  4. Thought — HQ confirmed. Now I need population, not address — my next query needs the 'population' keyword.
  5. Action — web_search("San Francisco population 2024")
  6. Observation — Result: ~808,000 (2024 estimate).
  7. Final Answer — Anthropic is headquartered in San Francisco; population ~808,000.

Four extra steps, zero unverified assumptions. The Observation at step 3 is the moment the agent earns the right to proceed — it now knows the HQ rather than assuming it. The Thought at step 4 then refines the next query based on what was just learned.

The think tool — mid-loop reasoning

Anthropic introduced a designated mid-loop reasoning primitive called the think tool. It is distinct from extended thinking (which fires at start-of-turn): the think tool fires between tool calls, inside the loop, as a deliberate scratchpad for the model to reason about what a tool result means before deciding on the next action.

In the simulation, toggling "+ think tool" inserts a violet Think card between steps 3 and 4 — after the first Observation lands and before the next Action fires:

The result confirms HQ. Now I need population, not address. Next query needs 'population' keyword, not 'location'.

That card is not visible to the user; it is a hidden reasoning step that refines the next query. The think tool does not add new information — it adds explicit deliberation over information already in the context, preventing the kind of sloppy query that a rushed action step would emit.

What the blind path teaches

The counterfactual is the clearest lesson. Every wrong answer a blind-action agent gives has the same root cause: an action was committed before a prior action's result was read. The agent was not unintelligent — it was uninformed at the moment it acted.

The Observation step is, mechanically, just the tool's output appended to the context. But appending that output before the next Thought, and requiring the Thought to attend to it, is what separates an agent that adapts from an agent that merely executes a plan it wrote in advance.

Try it: Hit Play in Stage 3 and watch the full trace build card by card. Then toggle "Without observation feedback" and Replay. The trace collapses from three steps to three, and the Final Answer card shows the ✗ wrong-city error. Turn on "+ think tool" in the full trace to see the violet Think card appear after the first Observation.

Step 4 takes the next failure mode: what happens when the agent does pause, does observe, but still produces a wrong answer — and how written reflection turns that failure into a lesson for the retry.

When to Retry

Why does retrying without reflecting fail?

When an agent fails a task and tries again with the same strategy, the second attempt has no more information than the first. The model still does not know why it failed; it just knows that it did. Three identical retries produce three identical failures. Reflexion, introduced in Shinn et al. (2023), replaces identical retries with a deliberate learning loop: after each failure, the agent generates a written reflection — a diagnosis of what went wrong — and feeds that reflection into the next attempt. The reflection is what carries forward, not just the score.

The parse_invoice task

The task in the Stage 4 simulation is: Write parse_invoice(text) that extracts vendor name, total amount, currency, and due date. Pass the 4 unit tests.

Four unit tests act as the verifier — a programmatic check that gates progress to the next attempt. The verifier does not care why the code failed; it just reports which tests passed and which did not. That binary signal is what the reflection mechanism amplifies into something learnable.

Attempt 1 starts with a simple date and amount regex:

date = re.search(r'(\d{4}-\d{2}-\d{2})', text)
amount = re.search(r'\$(\d+\.\d{2})', text)

Tests T1 and T2 pass; T3 and T4 fail (2/4). The reflection diagnoses the gap:

Test 3 failed: date regex assumed YYYY-MM-DD; the test uses 'March 14, 2026'. Test 4 failed: I parsed '$' assuming USD, but the test uses '€'. Need to handle multiple date formats and currency symbols.

Attempt 2 incorporates both lessons:

date = dateutil.parser.parse(text)
currency = re.search(r'([€$£])([\d.]+)', text)

T1–T3 now pass; T4 still fails (3/4). The reflection identifies the remaining gap:

Test 4 still fails: I detected the currency but didn't normalize to a comparable amount. The test expects raw amount + a separate currency field, not USD-converted. I was over-engineering.

Attempt 3 applies the correction:

m = re.search(r'([€$£])([\d.,]+)', text)
return {"amount": float(m.group(2)), "currency": m.group(1)}

All four tests pass. The ✓ Done badge appears.

Toggle "No reflection — just retry" and replay. All three attempts are identical: same code, same 2/4 score, no progress. Without a reflection, the retry loop has no mechanism to distinguish attempt 2 from attempt 1.

What a verifier is

A verifier is a programmatic check that gates progress — a test suite, a schema validator, a type-checker, or an external API call that returns success or failure. In the simulation the verifier is four unit tests shown as colored circles (green = pass, red = fail). Verifiers matter for two reasons beyond the immediate task.

First, they give the agent a clear stopping criterion: if the verifier passes, the task is done. Without that signal, the agent must rely on its own judgment about whether its output is good enough — a judgment that is unreliable for exactly the tasks where agents struggle.

Second, the field has converged on the view — associated with Will Brown's Verifiers framework — that programmatic checks double as training signals and eval artifacts. A verifier that was written to gate a single agent run can be reused as a reward signal in reinforcement fine-tuning, and as an eval harness when the model or prompt changes. Writing a verifier once pays three dividends: it gates the current run, trains the next model version, and measures the system over time.

Why verbal reflection works

Shinn et al. frame Reflexion as "verbal reinforcement learning": the model updates its own behavior not by adjusting weights (which requires a training loop) but by writing a natural-language diagnosis that becomes context in the next attempt. The reflection is essentially a compressed lesson: here is what I tried, here is where it broke, here is what that implies about the next strategy.

The mechanism requires that the reflection be specific — not "I should try harder" but "the date regex assumed YYYY-MM-DD, and the failing test uses a spelled-out month." Specificity is what makes the lesson actionable in the next attempt. A vague reflection is structurally identical to no reflection: it cannot guide a different strategy.

Try it: Play Stage 4 and watch each attempt reveal its verifier circles and then its reflection card. Note how each reflection in the violet side card points to the specific failing test, not to a general quality goal. Then toggle "No reflection — just retry" and replay — all three attempts are identical. The counterfactual makes the mechanism concrete: three retries, same score, no progress.

Step 5 addresses the last piece: once the verifier might never pass (or the budget runs out), the agent needs a rule for when to stop rather than loop indefinitely.

When to Stop

How does an agent know when to stop?

An agent without a stopping rule has no way to exit a loop other than running out of compute. Without a rule, it will retry past the point of diminishing returns, spend tokens on analysis it cannot act on, and occasionally loop indefinitely — never committing to a final answer because no exit condition ever fires. The fix is explicit: write down the stopping condition before the agent starts.

Three rules cover most production cases:

  • Verifier passes — a programmatic check signals success.
  • Budget exhausted — a pre-set token or step cap is reached; return the best available answer.
  • Ask the user — the agent has hit a decision that requires information or authority it does not have; escalate rather than guess.

Each rule has a different cost profile and a different failure mode, which is exactly what the Stage 5 simulation shows.

The bug-hunt task

The task in Stage 5 is: "Find the bug causing intermittent 502s on the /checkout endpoint."

The simulation runs three columns simultaneously — one per stopping rule — so you can see how the same task terminates differently under each.

Verifier-only stops in four ticks (~3,200 tokens). The agent finds the slow DB query, adds connection-pool config, redeploys, and monitors. At tick 4 the monitoring check returns zero new 502s. The verifier passes: ✓ bug fixed. This is the cleanest outcome: a programmatic signal ends the loop on objective grounds.

Budget-only (cap = 6) stops at six ticks (~5,800 tokens). The agent scans more broadly — checking the payments service, upstream timeouts, adding monitoring — but the budget cap fires before a conclusive signal arrives. ⚠ Budget reached — partial answer. The agent has done real work but cannot declare success because the verifier never ran (or was never defined). The partial answer may still be useful, but the caller must decide what to do with it.

Ask-user (escalate) stops at two ticks (~1,800 tokens). After two steps the agent identifies two plausible root causes (DB lock vs. upstream timeout) and determines that differentiating them requires production DB access it does not have. ↑ Escalated — needs human input. This is the correct behavior: an agent that guesses between two plausible diagnoses and picks wrong has done more harm than one that surfaced the ambiguity immediately.

The runaway counterfactual

Toggle "No stopping rule" in the simulation. The tick counter advances past 30, past 50, climbing indefinitely while the agent continues to analyze log entries it has already analyzed. Without an exit condition the loop has no objective ground to stand on — it cannot distinguish "I have done enough" from "I could do more." This is not a pathological edge case; it is the default behavior of any retry loop that lacks a termination criterion.

The runaway mode illustrates something specific: the agent is not obviously doing the wrong thing at any individual step. Each tick looks locally reasonable. The failure is structural — the absence of a rule, not the presence of bad reasoning.

Choosing the right rule

The three rules are not mutually exclusive. Production agents often combine them: run until the verifier passes, cap at a budget to prevent runaway, and escalate if a specific condition fires (e.g. "requires permission I do not have"). The order matters: check the verifier first (success), then the escalation condition (known blocker), then the budget (time's up).

A few practical considerations:

  • Define the verifier before the loop starts. If you cannot define a programmatic success criterion, you are not ready to run the agent autonomously.
  • Make the budget explicit in the prompt. Telling the agent "you have at most 10 tool calls" lets it allocate effort, not just cut off mid-thought.
  • Escalation is a feature, not a failure. An agent that escalates cleanly after two steps is more useful than one that runs for 30 ticks and returns a low-confidence guess.

For tasks with multiple sub-goals, each sub-goal can carry its own stopping rule — a pattern closely related to running sub-tasks in isolated subagent threads so their loops do not bleed into the parent's context. (See Context Engineering — Subagents step for the full treatment of context isolation via subagents.)

Once you have stopping rules in place, the natural next question is: how often does the agent actually finish correctly? That is the entry point for Module 7 (Evals & Diagnostics), which builds the measurement layer on top of the loop mechanics covered in this module.

Try it: Hit Play in Stage 5 and watch all three columns advance simultaneously. Note the tick counts and token spends when each column terminates. Then toggle "No stopping rule" — the counter climbs past 30 ticks with no exit. The contrast between the three clean terminations and the runaway is the key lesson: stopping rules are the structural requirement that makes an agent controllable.

Frequently Asked Questions

© 2026 Learn AI Visuallycraftsman@craftsmanapps.com