What is the AI Agents track?

Nine interactive modules covering the agent loop, tool use, workflow patterns, retrieval & RAG, context engineering, planning & reflection, evals & diagnostics, security & the lethal trifecta, and a comparative capstone.

Who is this track for?

Engineers building production AI products and AX (AI Transformation) engineers automating workflows. Assumes basic comfort with Python and one LLM API call.

Is this Track A enough to ship a production agent?

No. Track A produces prototype-ready learners with sound architecture, basic eval discipline, and security awareness. Production shipping requires the upcoming Track B (Agent Engineering) or equivalent operational controls.

An LLM agent runs tools in a loop to achieve a goal. The three irreducible parts are the model (decision-maker), tools (actions on the world), and a control loop that repeats until the goal is met.

How is an agent different from a workflow?

Workflows are LLMs and tools orchestrated through predefined code paths. Agents dynamically direct their own processes and tool usage. Anthropic recommends starting with workflows and only escalating to agents when scope is open-ended.

State is what persists between turns of the agent loop — typically a typed object containing messages, intermediate results, todos, and the next action. State is the agent loop's variable.

The Agent Loop & State — How an LLM Agent Actually Works

Why Agents?

What is an AI agent?

An AI agent is a program in which a large language model runs tools in a loop to achieve a goal. On each pass through the loop the model reads its current state, picks an action, the runtime executes the action, and the result is written back into the state. Repeat until the goal is reached or a stopping rule fires. That is the entire mechanism — there is no extra "agent algorithm" hiding underneath. Everything else in this track (state, tools, planning, retrieval, evals, security) is a layer on top of this loop.

The three irreducible parts

After two years of refusing to use the word, Simon Willison settled on the definition above in 2025: "an LLM agent runs tools in a loop to achieve a goal." Strip away the marketing and three parts remain:

A model — the decision-maker. Given the current state, it emits the next action. Each call is a normal next-token forward pass; recall from LLM Internals: Generation that the model is just sampling one token at a time, no magic.
Tools — the actions on the world. Read a file, query an order database, call an external API, send an email. Tools are how the model reaches outside its own context window.
A loop — the control flow. Take an action, observe the result, decide what to do next. Without the loop, the model gets exactly one shot — call one tool, return one answer, done.

Drop any one of the three and you no longer have an agent. A model alone is a chatbot. A model plus tools but no loop is a single function call. A loop with no model is a script.

What happens in one tick?

Every iteration of the loop — every tick — has the same four-phase shape. The simulation on the right uses these names exactly:

Gather context — load the current state into the model's prompt: the user's request, available tools, and any prior observations.
Decide — one model call. The model emits a structured response: either a tool call (lookup_order(id="ORD-4521")) or a final reply to the user.
Act — the runtime executes whatever the model picked. If it was a tool call, the tool runs and returns a result. If it was a reply, the reply is sent.
Observe — the result is written back into state, ready for the next tick to read it.

The loop terminates when the model decides to reply instead of calling another tool, or when a stopping rule (max ticks, max tokens, verifier passing) fires. We'll come back to stopping rules in Module 6.

Why does the loop matter?

A single LLM call is a closed-book exam — the model answers from whatever fits in its context window, then stops. That is fine for "summarize this paragraph," but it falls apart the moment a real task needs a fresh fact, a side effect, or a multi-step plan. Has the order shipped? What's the current price? Has the test suite turned green yet? The model cannot know without running something.

The loop is what makes the difference. In every iteration the agent gathers context, the model decides on a tool call, the runtime acts, and the result is observed back into state for the next iteration. Anthropic's Building Effective Agents (December 2024) draws a sharp line between workflows (LLMs and tools strung together along predefined code paths) and agents (LLMs that "dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks") — that dynamic control is exactly what the loop buys you.

Andrej Karpathy's LLM-OS framing treats the same picture differently: the model is the CPU, the context window is RAM, tools are peripherals, and the harness around them is the operating system. The loop is the scheduler — it keeps handing the CPU work until the task terminates. Step 2 of this module strips the LLM-OS down peripheral by peripheral so you can see what each one buys you.

What you'll see in the simulation

The right pane runs a tiny but real agent: a customer asks "What's the status of my order #ORD-4521?" and one tool, lookup_order(id), is available. Watch the four phases light up in order, the trace log accumulate one line per phase, and the loop terminate after two ticks with one tool call.

Try it: Press Start to watch the loop run end-to-end, or press Step to advance one phase at a time. Notice that tick 1 is a tool call (the model has no answer yet), and tick 2 is the reply (the tool result is now in state, so the model has enough information to terminate). The same four-phase shape — gather, decide, act, observe — repeats every tick of every agent in this track.

Strip Away the Peripherals

What is the LLM-OS framing?

The LLM-OS framing treats a language model the way you'd describe an operating system: the LLM is the kernel, the context window is RAM (working memory the kernel can read this instant), tools are peripherals (the model's hands and eyes — APIs, code interpreters, browsers), and a vector store is the file system (long-term knowledge the kernel can swap into RAM on demand). It is an analogy, not an equation, but it gives you a shopping list for what an "agent" actually contains. Andrej Karpathy proposed it in his 2023 talk Intro to Large Language Models; since then it has become the standard way to draw an agent on a whiteboard.

What can an LLM kernel do on its own?

Almost less than you'd guess. An LLM with no peripherals is a closed-book exam-taker — it can paraphrase, classify, define, translate, and reason over whatever you fit into the context window, and that is it. Three categories of task break instantly:

Anything live. "Did my order ship?" needs a database call. The kernel can guess but cannot know.
Anything across turns. "Use the units I told you earlier" needs durable memory. The kernel forgets the moment the previous prompt rolls out of context.
Anything from your private corpus. "What does our refund policy say?" needs retrieval. The kernel has read the open web, not your company's policy.md.

In all three cases the failure mode isn't "the model says it doesn't know" — it's "the model fabricates a plausible-sounding answer." That fabrication is the most useful argument for adding peripherals: each one replaces a guess with a fact.

What does each peripheral unlock?

Peripheral	Capability it adds	Failure if missing
Tools	Action — call APIs, run code, query databases, send email	Can't reach live systems; guesses
Memory	Continuity — durable user prefs, prior decisions, scratchpads	Forgets across turns
Vector store	Knowledge — search private docs, ground answers in citations	Hallucinates policy / facts

The rest of this track is essentially a tour of these peripherals. Tool Use (Module 2) covers the schema and runtime for the action peripheral. Retrieval & RAG (Module 4) covers the vector store. Context Engineering (Module 5) covers what you actually load into the kernel's RAM each tick. The point of this step is to internalize that none of those modules is optional decoration — each one closes a specific failure mode you can see right now in the simulation.

Why is the analogy worth stretching?

Two reasons the OS analogy keeps earning its keep on whiteboards.

First, it forces the question "what's in RAM right now?" at every tick. The context window has a hard cap (typically tens to hundreds of thousands of tokens depending on the model), and everything the kernel can reason about this instant has to fit. A vector store with a billion documents doesn't help if you don't page the right paragraph into RAM before the model decides. That paging logic — what to retrieve, when to summarize, when to drop — is the entire job of context engineering.

Second, it draws the trust boundary in the right place. Tool calls are syscalls: they leave the kernel, hit untrusted territory, and return values the kernel must validate. Treating tool output as just-more-text is the original sin behind prompt injection. We will come back to this in Module 8 (Security), but the framing is set here: the kernel has to assume peripherals can lie.

Why this matters for the loop

Once you see an agent as kernel + peripherals, the loop stops looking mysterious. The model, given the current state, decides which peripheral to invoke next; the runtime invokes it; the result lands back in RAM; the model decides again. The next step ("The State Object") looks at exactly what sits in RAM between ticks — because that bag of variables is what makes the difference between a kernel that remembers and one that forgets.

Try it: All three peripheral toggles start off. Press Start — only the bottom-right "pure-text Q&A" panel succeeds; the other three fail in different ways. Now flip each peripheral on, one at a time, and press Start again. Watch which panel turns green as you add each peripheral. The point isn't that more peripherals are always better — it's that each capability you want from an agent corresponds to a specific peripheral you have to give it.

The State Object

What is agent state?

Agent state is the loop's variable — the typed object the agent reads at the top of every tick and writes back to at the bottom. It usually holds the conversation history, any intermediate tool results, an in-flight todo list, and a next action the model has decided on. State is what makes tick N+1 behave differently from tick N: same model, same tools, but different state — and so a different decision. It is narrower than "memory" (which spans sessions, users, and persistent stores); state is the per-run scratch space the loop mutates in place.

What does the state object hold?

There is no canonical schema, but there is a canonical shape. Most agent frameworks converge on something close to this:

type AgentState = {
  messages: Message[];        // user turns, model turns, tool turns
  tool_results: ToolResult[]; // raw outputs from the most recent tool calls
  todos: string[];            // tasks the agent is tracking this run
  next_action: string;        // what the runner should execute next
};

LangGraph's state concept makes this explicit: every graph defines a typed State (a TypedDict or Pydantic model) and each node's job is to return a partial update to merge in. OpenAI's Agents SDK calls the equivalent surface a Session — a typed bag the runner threads through every step. Anthropic's effective context engineering writeup describes the same object less formally as "the working set the model sees this turn." The names differ; the role is the same.

Three things you might expect to see here but won't: the system prompt, the tool surface, and the current user input. Those are inputs to the loop, owned by the harness or the prompt-assembly layer above it — they aren't mutated tick by tick. State is specifically the part the loop changes between ticks, which is why it gets its own name.

How do the four phases interact with state?

The four-phase tick from the previous step is mostly a story about who reads and who writes:

Gather reads the whole state — messages, todos, next_action — to assemble the model's prompt.
Decide reads the conversational context (messages, todos) and writes a new next_action.
Act reads next_action, executes the tool, and writes the tool's output into tool_results.
Write reads tool_results and folds them back into messages and todos so the next tick's gather sees them.

Drawing those read/write edges is the cheapest way to spot bugs. A phase that writes a field nothing later reads is dead code. A phase that reads a field nothing earlier writes is a hallucination waiting to happen.

What happens when state is malformed?

Each field exists because a specific phase needs it. Pull a field out and the agent breaks at the first phase that tries to read it:

Drop next_action and Act has nothing to execute — the agent stalls right after Decide.
Drop todos and Decide loses its sense of what's still outstanding, so it terminates early or repeats work.
Drop messages and every Gather call sends the model an empty prompt — it cannot ground its next decision.

This is why state shape and state management get top billing in Anthropic's effective context engineering writeup and LangGraph's low-level concepts docs. The bug is rarely "the model picked the wrong action" — it's usually "the field the model needed wasn't in state when the phase ran." The next step, Inside a Tick, zooms in on what flows through one tick — and shows that most of it vanishes by the time the tick ends.

Try it: Press Start and watch the state object mutate phase-by-phase across three ticks — blue and amber highlight which fields each phase reads vs writes. Then press Reset, click one of the field chips at the top to remove it (e.g. todos), and press Start again. The trace log will tell you exactly which tick the agent breaks on and why. Try removing next_action versus messages — they fail at different phases.

Inside a Tick

What flows through a tick that doesn't survive it?

Most of what happens inside an agent tick is ephemeral. Gather assembles a prompt blob — system message, tool list, recent history, current user turn — and sends it to the model. Decide receives a raw model output — XML-tagged or JSON, depending on the provider — and parses it. Act dispatches a tool-call payload over the network and gets a raw HTTP response back. All four of those artifacts vanish by the time the tick ends. The only thing that lands in the state object is the parsed observation — a small, cleaned record of what the tool returned. State is the survivor; everything else is scratch space.

What does Gather actually build?

Gather's job is to construct the literal prompt the model will see this turn — fresh, every tick, from the current state. There is no conversation hidden inside the model; if a fact isn't in the prompt, the model can't see it. A typical Gather output looks like this:

SYSTEM: You are a customer support agent.
TOOLS:  [lookup_order, lookup_customer]
HISTORY:
  User:  Hi
  Agent: Hello! How can I help?
CURRENT:
  User:  What's the status of order #ORD-4521?

This blob is sent to the provider's API and is gone the moment the model returns. The next tick's Gather will rebuild a similar — but not identical — prompt from the updated state. The reason this matters: every byte you choose to include is competing for the model's finite context window, which is the topic of Module 5 (Context Engineering).

What does the model actually return?

Decide's input is the prompt; its output is whatever the model emits, in whatever wire format the provider chose. Anthropic's tool-use API returns tool calls as structured tool_use content blocks (JSON) attached to the assistant message — that's what your SDK hands you. Underneath, Claude is trained to emit XML-tagged regions that the API structures for you; we'll show that underlying form below, since the unwrapped XML is the shape a parser actually has to handle when something goes wrong:

<thinking>
The user wants the status of order ORD-4521.
I should call lookup_order with that ID.
</thinking>
<tool_use>
  <name>lookup_order</name>
  <input>{"id": "ORD-4521"}</input>
</tool_use>

OpenAI's function-calling API emits a roughly equivalent JSON-shaped tool_calls array attached to the assistant message. Either way, this raw blob is also ephemeral — the harness immediately runs it through a parser to extract a structured (tool_name, arguments) pair, and the original text is discarded. The model's "intent" only exists, in any usable form, inside that parsed pair.

What does Act actually dispatch?

Act takes the parsed tool call and runs it against the world. For a lookup_order tool backed by an internal HTTP service, Act's actual operation is a network call:

→ POST /api/orders/ORD-4521
← 200 OK
  {
    "id": "ORD-4521",
    "status": "shipped",
    "carrier": "UPS",
    "tracking": "1Z9X4-A7F2",
    "_raw_db_row": "0x4af2…"
  }

Both the dispatched payload and the raw HTTP response are ephemeral. The harness parses the response — usually pulling out a small, clean subset — and writes that into state as the observation. The _raw_db_row field, the headers, the timing — all of it gets dropped on the floor. Only the observation survives, because only the observation will be in state when the next Gather rebuilds the prompt.

Why is the parser the agent's most fragile component?

Here's the load-bearing observation: the parser sits between two ephemeral artifacts and a persistent one. If the model's raw output is malformed in any way — a missing name field, broken JSON, an unclosed XML tag, an apology paragraph wrapped around the tool call — the parser fails to extract a tool call. Act has nothing to dispatch. Observe has nothing to write. The tick produces no state mutation, regardless of how good the model's reasoning was. The reasoning lived only in the raw output blob, and that blob is now gone.

This is why structured outputs are the next thing every agent stack reaches for. The simulation's parser-broken toggle shows the stall directly; the durable fix is to push the contract down to the model itself — either with a provider's native structured-output mode, or with a typed-schema layer like the Instructor library on top of Pydantic models. The model is constrained to emit a payload that will parse, so the bridge between Decide and Act stops being a guess. The next module — Tool Use & Function Calling — picks up exactly this seam; for now, the takeaway is just that the parser is the agent's load-bearing seam, and the seam is fragile by default.

Try it: Leave the toggle on ✓ Parser strict and press ▶ Play tick — you'll watch the four artifacts fade in, each tagged ephemeral · vanishes except the final parsed observation tagged persists in state. Then click the toggle to ✗ Parser broken and play again. The raw model output still arrives at Decide, but Act and Observe both fail — no tool dispatched, nothing written to state. The footer flips from "only the parsed observation" to "nothing — loop stalled." Same model, same reasoning, same prompt; one missing field in the wire format wipes the entire tick.

Why Workflows First?

What is the difference between a workflow and an agent?

This is the most consequential decision in agent design, and Anthropic's Building effective agents draws the line as cleanly as anyone has:

Workflows are systems where LLMs and tools are orchestrated through predefined code paths.

Agents are systems where LLMs dynamically direct their own processes and tool usage.

A workflow is a DAG you wrote in advance — classify → route → lookup → format. The model fills in slots; the control flow is yours. An agent is the loop from the previous steps: the model picks each tool turn-by-turn, and the harness keeps calling the loop until the model says "done." The workflow is a recipe; the agent is a cook.

Why are workflows the default?

Anthropic's piece is unambiguous: start with workflows, escalate to agents only when you need to. Three reasons, all visible the moment you put the two side by side.

Cheaper. A workflow makes the minimum number of model calls the task needs — usually one or two. An agent loops until the model decides it's finished, and each tick rebuilds the prompt from state. Anthropic notes in the same writeup that agents "typically use more tokens than chat" in their internal usage; in practice the multiplier on simple tasks is often several×, sometimes an order of magnitude.
Debuggable. When a workflow breaks, you read the code path that ran. When an agent breaks, you replay the loop trace and reason about why the model picked the tool it did on tick 3.
Predictable. Same input, same path, same output. Workflows fail in named places — the classifier returned an unknown label, the lookup hit a 404 — and those places are the right places to put metrics, retries, and human review.

The trap is reaching for an agent loop because the word "agent" is in the air. Most production systems labelled "agentic" are workflows with a model call inside one of the boxes.

When should you escalate to an agent?

There are exactly three signals that justify the escalation, and they are the cases where a workflow physically cannot work:

The number of steps is unknown ahead of time. You can't draw a DAG if you don't know how many nodes it has.
The path differs per input in ways code can't enumerate. If branching requires understanding the input rather than matching it, a router won't get you there.
The scope is open-ended. Research, debugging, and planning don't have a fixed shape — the model has to decide, in flight, which sub-question to chase next.

Concretely: an FAQ bot is a workflow. A ticket triage system is usually a workflow with one model-shaped box (classify intent) and predictable downstream branches. A "find the regression that broke CI" assistant is an agent — the path through the codebase isn't knowable in advance, and that's the whole job.

How does OpenAI frame the same decision?

OpenAI's writing on agentic systems frames the same choice as a spectrum rather than a binary, and OpenAI's Practices for Governing Agentic AI Systems whitepaper sketches roughly four dimensions of "agenticness":

Goal complexity — single well-defined output vs an open-ended objective.
Environment complexity — fixed schema vs an evolving world the system has to interpret.
Adaptability — the same plan every time vs a plan that has to change mid-run.
Independent execution — supervised step-by-step vs runs unattended for many turns.

A task that's high on all four belongs in an agent loop. A task that's low on all four belongs in a workflow. Most real systems sit somewhere in between — which usually means a workflow with one carefully-bounded agent step inside it, not a single end-to-end agent.

The next step zooms back in to the harness itself — the six pieces every agent loop needs once you've decided you actually need one.

Try it: The right pane shows the same task running through both lanes. Start on Answer FAQ from a known KB and press ▶ Play — both lanes finish, but watch the token counters: the agent burns several × more tokens to produce the same answer the workflow already had. Now switch to Debug a failing multi-file build and play again. The workflow lane runs out of patterns to match and lands on a red ✗ no path matches node; the agent finishes in four ticks. Triage sits in between — the workflow handles the common case, but you can see where an open-ended intent would slip past it.

The Anatomy of a Harness

What is an agent harness?

An agent harness is the runtime code that wraps a model into a working agent. It is everything that isn't the model: the loop driver, the state object, the tool registry, the policy that decides what's allowed, and the machinery that lets the run survive a crash. The model decides; the harness keeps everything else true. Take the model out and you have an empty harness; take the harness out and you have a chatbot.

The mental model that's stuck for a lot of practitioners comes from Phil Schmid's The importance of Agent Harness in 2026:

The Model is the CPU — raw processing power. The Context Window is the RAM — limited, volatile working memory. The Agent Harness is the Operating System — curates context, handles the boot sequence, provides standard drivers (tool handling). The Agent is the Application — the specific user logic running on top of the OS.

In that picture, the harness is the layer most people skip when they're prototyping and the layer that determines whether the thing survives contact with production.

What does the harness actually own?

In practice, you'll find these six responsibilities in every working agent harness, even when some are implicit. Drop any one and the run breaks in a specific, observable way — that's what the right pane lets you poke at.

Component	What it owns	Failure mode if missing
Model	Which model to call, with what params (temperature, max tokens, tool choice).	No decision engine — nothing chooses the next action.
State	The state object, where it lives, how it's serialized between turns.	No memory between ticks — the model re-asks the same question forever.
Tools	The tool registry: schemas, dispatch, return-shape contracts, permissions.	The model can decide to call `lookup_order` but nothing executes the call.
Policy	What the agent is allowed to do, when to ask for approval, hard caps on cost / calls / time.	Unbounded loops, runaway cost, or unsafe actions reaching production.
Runner	The loop itself — error handling, retries, timeouts, the stop condition.	Single-shot only. The model gets one call and the run ends.
Checkpoint	How progress survives crashes, restarts, and long-running work.	Any crash erases everything — N ticks of work, gone.

The split is roughly: Model + Tools are what the agent is made of, State + Checkpoint are what it remembers, Runner + Policy are what controls when it stops.

Are these always six separate pieces?

Rarely. In every framework that ships an "agent runtime" today, all six responsibilities are present, but the boundaries shift:

Claude Agent SDK — runner, tool registry, and a default state shape are bundled into the SDK; policy lives in permission_mode + allowed_tools; checkpointing is the JSONL session file on your filesystem.
OpenAI Agents SDK — the Agent class is the runner; tools are decorated Python functions; state is the conversation thread; policy lives in handoffs and guardrails.
LangGraph — the graph itself is the runner and policy combined; state is a typed dict you define explicitly; checkpointing is a first-class concept (in-memory, SQLite, Postgres backends).

When you write an agent loop from scratch, the trap is forgetting the boring two — policy and checkpoint. Demos run fine without them. Production runs do not.

Try it: All six toggles are on by default, so press ▶ Play first to see the happy path — four ticks, then ✓ Done. Now toggle Checkpoint off and play again: three ticks of work, then 💥 crash, run starts over from scratch. Toggle Policy off: the agent loops past tick 47 with no call cap. Toggle Tools off: the model wants to call lookup_order but nothing dispatches it. Each off-toggle is a different production-scale failure mode in miniature.

Where this leaves Module 1

Five steps ago this module started with a single claim: an agent is a model running tools in a loop. The four steps after that filled in what each word means — the loop itself, the state object it carries between turns, the four-phase tick that runs every iteration, and the workflow-vs-agent decision that should happen before any of this. This step gave the loop a body — the harness — and a list of the six things that body has to do.

Everything from here builds on this picture. Module 2 — Tool Use & Function Calling zooms into the Tools slot: schemas, the agent–computer interface, structured outputs, MCP, and the failure modes of bad tool design. The rest of the track refines the other slots one by one.

Module 1 takeaway: an agent is a loop plus mutable state, autonomy is just how much of the loop you let the model control, and the harness is the six concrete pieces that hold the whole thing up. Everything that follows in this track is a deeper look at one of those pieces.