Learn AI VisuallyTracksAI Explained

Context Engineering for AI Agents

Context as Scarce Resource

What is context engineering?

Context engineering is the discipline of deciding what goes into the context window — and what comes back out — across many turns of an agent's loop. Andrej Karpathy named it as "the delicate art and science of filling the context window with just the right information for the next step". Prompt engineering writes a single good prompt; context engineering manages a budget over time, choosing what to keep, what to summarize, and what to drop as the conversation grows. Anthropic frames it the same way in Effective context engineering for AI agents (Sep 2025): the context window is the agent's working memory, and how you fill it is now its own discipline.

Why context fills faster than you think

The window looks like an infinite scratchpad, but three pressures all draw on the same fixed budget:

  • Conversation history — every user turn and every model reply, accumulated verbatim. A 20-turn support chat lands somewhere around 4–6k tokens before any tool runs.
  • Tool outputs — these are usually the largest contributor. A single search-tool call that returns 15 chunks can be ~2,500 tokens, and an agent often runs several per turn.
  • System overhead — system prompt, few-shot examples, and tool definitions are paid every turn, even when they don't change. A 50-tool kit alone is typically ~10k tokens (full treatment in Step 4).

Add those up across a real session and a 16k window fills shockingly fast.

Quality is a curve, not a cliff

Performance doesn't fall off a cliff at some specific fill percentage; it degrades gradually as the window fills with low-signal material. Anthropic explicitly notes this degradation in the post linked above without committing to a universal threshold — model, task, and content all shift the curve. The ~70% inflection drawn on the meter in the simulation is illustrative LAV-curated demonstration data, not a hard number to plan against. The ah-ha is the shape: the very tokens you keep adding are what makes the model dumber.

This is why context engineering matters. The conversation that filled the window is the same conversation the model now needs to attend to. Every byte you decline to keep is attention you give back to the bytes that mattered.

Multi-turn agents reuse the same prefix across turns, so the model's KV cache makes long contexts cheaper than they look — but cheaper to compute is not the same as cheaper to think about. Compute cost falls; attention dilution doesn't.

Try it: Hit Play in Stage 1 and watch the fill bar climb turn by turn. The two quality bars run side by side: Naive slumps as the window fills; With compaction barely moves. The growing gap between them is exactly the problem Step 3's first fix flattens.

Step 2 names the four specific ways context goes wrong once the window fills.

The 4 Failure Modes

Why does context break down?

A full window is rarely the whole story — what fills it matters more than how full it is. Drew Breunig's How Long Contexts Fail (June 2025) names four flavors of breakdown that show up over and over in production agents: poisoning, distraction, confusion, and conflict. The four are categories, not the only ways context can break — but they're the most useful four-letter alphabet for triaging context bugs, and each maps cleanly to a fix family in Step 3.

Poisoning

A wrong fact — usually a hallucination from an earlier turn or a malformed tool output — gets persisted into the conversation state and corrupts every downstream turn. Once the model has "agreed" that the user's account is a Pro plan when it isn't, every subsequent answer leans on that false premise. Example: an early turn confidently records a non-existent function signature, and twelve turns later the agent is still calling it. The fix family is structured note-taking and retrieval discipline (Step 3) — pin the canonical facts to a place that's harder to corrupt than free-form chat history.

Distraction

The right information is technically in the window, but it's buried in so much irrelevant material that the model misses it. This is the same shape as the lost-in-the-middle effect from Liu et al. (2023): models attend disproportionately to the start and end of long contexts, and the middle drifts out of focus. Example: the user's actual question is sentence three of turn 18, but turns 4–17 are stale tool dumps, so the model answers the wrong question. The fix family is compaction (drop the noise) and just-in-time retrieval (don't put the noise there in the first place).

Confusion

Overlapping or ambiguous instructions — a system prompt, several few-shots, and a user note that don't quite agree — leave the model triangulating between authorities. Example: the system prompt says "always cite sources", a few-shot shows uncited replies, and the user just said "be brief". Which wins? The model's guess varies turn to turn. The fix family is context quarantine (let one specialist subagent see one consistent slice) and tighter prompt hygiene.

Conflict

Turn N contradicts turn 1 because nothing summarized or resolved the earlier commitment. The user said "use Postgres" on turn 2 and "actually, use SQLite" on turn 14, and both statements still sit in the window with equal weight. The model has no signal that the second one supersedes the first. The fix family is compaction with explicit supersession — summarize old turns into a single canonical state, so the latest decision is the only one in the window.

Try it: Pick any of the four cards in Stage 2 to see the failure trace. Toggle the mitigation switch in each card to watch the same trace under the corresponding fix — Step 3 covers each fix in detail.

Step 3 walks the four fixes the field has converged on.

The 4 Fixes

How do we keep context healthy?

Anthropic's Effective context engineering for AI agents (Sep 2025) and Drew Breunig's follow-up How to Fix Your Context (June 2025) converge on roughly the same four-fix toolkit: compaction, structured note-taking, context quarantine, and just-in-time retrieval. None is universally best. Each pays a different tax — fidelity, round-trips, complexity — and each addresses a different subset of Step 2's failure modes. The job in production is picking the right one (often more than one) for the workload.

FixTargetsCost
No fix (baseline)nothing — every failure mode in timetokens grow unbounded; quality degrades
Compactiondistraction, conflictfidelity loss in summaries; one extra LLM call per compaction
Structured notespoisoning, conflictrequires a schema and write discipline; another surface to maintain
Context quarantineconfusion, distractionmore total tokens (subagents are expensive — see Step 5)
Just-in-time retrievaldistractionextra round-trips; latency per turn

Compaction periodically summarizes old turns into smaller blocks — the canonical reference is the auto-compaction loop in Anthropic's Claude Code coding agent, where superseded steps collapse into a one-paragraph state summary so the active window stays focused on the current task. The cost is fidelity: a summary of fifteen turns is necessarily lossy.

Structured note-taking moves durable state out of free-form chat history and into an external scratchpad — a progress.md or NOTES.md file the agent reads and writes by tool call. The advantage is that pinned facts are harder to poison than rolling chat context, and the agent can re-read them on demand instead of trusting that they survived in the window.

Context quarantine runs subtasks in isolated subagent threads, then crosses only condensed summaries back to the lead. Step 5 is the full treatment — including the honest tradeoff that quarantine often costs more total tokens, not fewer.

Just-in-time retrieval is RAG applied to the agent's own working memory: instead of preloading every potentially-relevant chunk into the system prompt, fetch only what the current step needs. Module 4 (Retrieval & RAG) is the mechanism; the context-engineering reframe is when. Pull at the moment of use, not at session start.

Try it: Cycle through the five strategies in Stage 3 and watch what survives in the window. Compaction trades fidelity for tokens; quarantine trades token count for attention quality; just-in-time trades round-trips for the smallest possible window. The same 20-turn conversation lands in a different shape under each.

Step 4 turns the same lens on a context bucket that often gets forgotten — tool definitions.

Tool-Loading Strategy

Why budget for tool definitions?

Tool definitions are part of the prompt. Every tool's name, description, and JSON schema get serialized into the system message every turn, whether the agent calls it or not. A typical tool definition runs ~100–300 tokens with a useful description and a non-trivial schema, so a 50-tool kit lands somewhere around 5–15k tokens of pure overhead before the user has typed a single character. Anthropic's Introducing advanced tool use on the Claude Developer Platform (Nov 2025) calls this out as one of the biggest silent context-eaters in production agents — and proposes the same fix the rest of this module has been circling: don't load what you don't need yet.

Eager vs deferred

The default in most frameworks is eager loading — every registered tool's full schema is in the system prompt from turn 1. Simple to wire up, and it works fine when the kit is small. It stops working when the kit grows: 50 tools is enough overhead to materially crowd out conversation history and few-shot examples.

Deferred loading holds the full catalog out-of-context and exposes a single retrieval tool — Anthropic's Tool Search Tool, for example — that fetches the top-k relevant tool schemas per turn. Anthropic reports comparable task accuracy at a fraction of the token cost in their internal testing; OpenAI and other providers ship similar deferred-loading patterns. The honest hedge: "comparable" is workload-dependent, and tasks that need many tools per turn benefit less.

The accuracy is the surprise

The intuition that "surely the model picks better when it can see all 50 tools" is the trap. With 50 tool defs in the prompt, the model is sampling its choice over a noisy field — the irrelevant 47 don't disqualify themselves, they just dilute attention across the catalog. Pre-filtering to the top-3 before the model sees them shifts the selection problem out of generation and into retrieval, where it's cheaper and easier to debug.

A second factor pulls in the same direction: when the system prompt is stable across turns, prefix caching makes the second turn onward cheap to compute. That's a cost win, not an attention win — the model still has to attend to those tokens. Deferred loading addresses both at once: smaller prompt and less to attend to.

Try it: In Stage 4, flip Eager → Deferred and watch the token caption drop from ~9,800 to ~1,550. Click any of the example queries: in Eager mode all 50 tool cells stay lit; in Deferred mode the Tool Search runs first, lights only the 3 relevant tools, then picks one. Same answer, ~6× cheaper preamble.

Step 5 takes the third fix from Step 3 — context quarantine — and walks the cost-vs-benefit in depth.

Subagents as Context Isolation

Why isolate context with subagents?

A single window holding everything — research notes, intermediate tool dumps, half-finished plans, the original user question — is the worst place to think. Context quarantine spawns subagents that each see only the slice they need, do focused deep work in their own window, and return a condensed summary to the lead. The lead's window stays small and on-topic; the specialists' windows stay narrow and clean. The tradeoff has two sides, and this step is honest about both.

The same primitive, two lessons

Module 3 (Workflow Patterns) introduced subagents as task decomposition — the orchestrator-workers topology, used to parallelize independent subtasks. This module reaches for the same primitive for a different reason: context isolation. The code shape is identical. The reason to choose it differs:

  • Decomposition lens — "this task has independent subtasks; run them in parallel."
  • Quarantine lens — "this subtask wants a clean window with no contamination from the rest of the conversation."

Many real systems use both lenses at once. Worth naming because the lessons stack, not compete.

What crosses the boundary

The discipline is in the return value. A specialist may consume tens of thousands of tokens — running searches, opening files, taking notes — but only a condensed summary crosses back to the lead. Anthropic's How we built our multi-agent research system (May 2025) reports return summaries in the ~1–2k token range in their internal multi-agent research deployment. The rule: lossy on purpose. The lead doesn't need the specialist's tool history; it needs the answer plus enough provenance to act on it.

The honest cost

Subagents are not a token-savings pattern. Anthropic reports their multi-agent research system uses roughly 15× the tokens of a single chat in their internal usage — the lead pays for itself plus the sum of every specialist. The pattern wins on focused attention, not on raw token count. Use it when each subtask deserves a clean window of its own — research, planning, deep verification, anything where the noise of the parent conversation would actively hurt the subtask. Skip it for short cheap lookups; the overhead is real and the attention win is small.

This cost shape is workload-dependent. A subagent that runs three tool calls and returns a one-line answer is barely more expensive than an inline call. A subagent that runs forty tool calls before summarizing is genuinely 15× — and worth it precisely because the lead never had to hold those forty results.

Try it: Toggle between the single-context and lead+3-subagents architectures in Stage 5. Compare total tokens against lead-only context tokens — the gap is the cost of quarantine. Click any specialist panel to inspect the raw output and the summary it returned.

When 15× is worth it, when it isn't, and how teams of agents coordinate at scale is the subject of Track B's Multi-Agent in Production module.

Frequently Asked Questions

© 2026 Learn AI Visuallycraftsman@craftsmanapps.com