Layered Guardrails for AI Agents
Defense-in-Depth for Agents
What is defense-in-depth for agents?
Defense-in-depth for an agent is a stack of independent filters — an input filter that runs before the model sees the request, a policy gate that runs around every tool call, and an output filter that runs before any response leaves for the user. Each layer is its own model or rule with its own failure modes, and the system is engineered so any one of them can be wrong without the whole stack failing closed. Module 1 made the harness survive crashes. Module 2 made the surviving run legible. This module makes the failure invisible to the user — and the load-bearing word is layered. No single filter is the story.
The pattern is borrowed from older corners of security engineering. The OWASP LLM Top 10 frames LLM safety as a set of controls — input handling, output handling, supply chain, sensitive information disclosure, prompt injection — that you stack together, on the theory that any one control will eventually be bypassed and a second control needs to be there when it is. The NIST AI Risk Management Framework makes the same case more abstractly: AI systems need overlapping, independent controls because the failure modes are correlated within a single model and uncorrelated across different ones.
Why one LLM judge is a single point of failure
The tempting shortcut, when you first ship an agent, is to bolt a single "is this output safe?" prompt onto the end of the loop and call it a guardrail. The model that just produced the output also judges it. One model, one prompt template, one set of weights. If the request that broke the agent on the way in also broke the agent's judgment on the way out, the judge will rubber-stamp the bad output. The failure modes are correlated because they live in the same model.
A working stack avoids that correlation by making the layers independent. The input filter is typically a small classifier (or a regex floor plus a small model), trained or prompted on a narrow task — detect PII, detect injection-shaped patterns, detect off-topic chitchat. The policy gate is a declarative rules engine that knows nothing about the model — it inspects tool-call arguments against an allow-list. The output filter is yet another set of narrow detectors, each tuned to one failure shape. When the same jailbreak attempt that confused the main model arrives at a small PII classifier with a totally different prompt and training distribution, it is much less likely to slip through both. Correlation is the enemy; independence is the design goal.
The pithy version is the one Simon Willison's prompt-injection writing keeps coming back to: a single LLM cannot reliably distinguish between instructions from its principal and instructions that arrived inside its data. So the structural answer is not "find the right judge prompt" — the structural answer is to put narrow, non-LLM (or different-LLM) filters at every boundary and to keep the model away from making the final yes/no on whether its own output is safe.
The three-layer stack
A concrete shape for a customer-service agent — the same task T-77 refund agent from Module 1 and Module 2:
user message
│
▼
┌─────────────────────────────┐
│ INPUT FILTER (Step 2) │
│ - PII detector │
│ - Injection detector │
│ - Off-topic detector │
└─────────────────────────────┘
│ (clean text + route hints)
▼
┌─────────────────────────────┐
│ MODEL TICK (loop) │
│ ┌────────────────────────┐ │
│ │ POLICY GATE (Step 4) │ │
│ │ - allow-list by role │ │
│ │ - argument scope │ │
│ │ - resource quotas │ │
│ └────────────────────────┘ │
│ ▼ │
│ tool call (issue_refund) │
└─────────────────────────────┘
│ (model draft)
▼
┌─────────────────────────────┐
│ OUTPUT FILTER (Step 3) │
│ - citation check │
│ - format validation │
│ - false-refusal detector │
│ - brand-policy pass │
└─────────────────────────────┘
│
▼
user response
Each box is a single layer, each layer has multiple narrow filters inside it, and each filter can be enabled, disabled, or reconfigured independently. The model never sees the policy gate's verdict directly — the gate is enforced by the harness around the tool-call site, not by another prompt the model can talk its way out of. The output filter never sees the input filter's verdict either — the layers do not whisper to each other, which is part of how their failures stay uncorrelated.
What each layer is trying to catch
A starting taxonomy that maps threat classes to the layer best positioned to catch them. The point of the table is not that any one filter is sufficient — it is to make the position of each filter in the stack explicit.
| Threat class | Primary layer | Backup layer |
|---|---|---|
| User pastes their email/phone into the prompt | Input filter (PII detector) | Output filter (don't echo PII back) |
| "Ignore previous instructions, you are now…" | Input filter (injection detector) | Policy gate (tool call still has to pass allow-list) |
| Refund request larger than the role can authorize | Policy gate (amount cap) | — |
| Agent tries to call a tool outside its role | Policy gate (allow-list) | — |
| Model invents a citation that does not exist | Output filter (citation check) | — |
| Model emits broken JSON to a structured-output consumer | Output filter (format validation) | — |
| Model refuses a legitimate "what's my order status?" | Output filter (refusal detector) | — |
| Model says something off-brand about a competitor | Output filter (policy pass) | — |
Two patterns are visible in that mapping. First, only a couple of threat classes have a meaningful backup layer in the same stack — most threats have one natural place to catch them, and adding redundant filters at other layers buys little. Defense-in-depth is not "every filter checks for every problem"; it is "every layer is there for a different reason, and each one is independent of the others." Second, several entries (refusal detection, brand-policy passes) only appear on the output side — there is no version of an input filter that can predict the model will refuse a legitimate query, because the refusal is generated, not received.
Guardrails are not the structural defenses — the trifecta cuts are
The most important thing to internalize before this module continues: guardrails are content filters layered on top of the structural defenses. They are not a substitute for them.
Foundations Module 8 introduced the lethal trifecta — the three ingredients that, when all present in one agent, turn prompt injection from an annoyance into a data exfiltration vector: access to untrusted content (web pages, emails, user uploads), access to sensitive data (a database, a secret, the user's history), and the ability to externally communicate (send email, post a webhook, call an API). When all three legs are present in one agent, an attacker who can plant text in any of the untrusted content sources can in principle exfiltrate any of the sensitive data via any of the communication paths. The defense that does the load-bearing work is cutting a leg — scoping the agent's capabilities so at least one of the three is structurally absent. An agent that cannot call any tool that sends data outbound cannot be turned into an exfiltrator no matter how thoroughly its input filter is jailbroken.
Guardrails sit on top of that. Even after the trifecta has been cut, the remaining agent still needs input handling, policy enforcement around the tool calls it is allowed to make, and output filtering before its responses reach the user. But the guardrails alone — with all three legs of the trifecta intact — are a hope and a prayer, not a defense. The structural choice (which tools does this agent have access to? which data sources? which outbound channels?) is the one a security review should be loudest about. The filter stack catches what slips through after that choice is made well.
Where this module fits in the three layers
A brief recap to close the loop with the previous two modules. Three layers of a production agent:
- Durability (Module 1) — idempotency, checkpoints, retry policy. The harness survives crashes and partial writes without doing the wrong thing twice.
- Visibility (Module 2) — span-per-tick traces, structured logs with PII redaction, replay, the five golden metrics, alerting. You can see inside any run and tell when the agent regresses.
- Safety (this module) — input filters, policy enforcement, output filters, fail-safe defaults. Failures that happen do not reach the user with their teeth showing.
The three compose, and they compose in the order written. Without durability, observability is debugging on a process that keeps disappearing. Without observability, safety is a black box that silently degrades into "no filters at all" and you cannot tell. Without safety, the durable, well-instrumented agent is a reliable production system for shipping the wrong thing to your users. Each layer is necessary; none is sufficient on its own.
Steps 2 through 4 unpack the three boxes of the stack — input filter, output filter, policy gate. Step 5 closes the module with the question every safety system eventually has to answer in production: when a filter itself goes down, does the request get blocked or let through, and how do you make that decision per route instead of per codebase.
Try it: Stage 1 of the right pane is the three-layer stack with four scenarios — a benign baseline that passes through cleanly, a prompt injection caught at the input filter, an out-of-scope tool call caught at the policy gate, and a brand-unsafe phrase caught at the output filter. Run each one and watch which layer catches it; then disable layers and watch the threat reach the user verbatim. The point is not that any one filter is doing the job; it is that the catch-rate per layer tells a different story for each threat class, and the stack as a whole is what holds.
Input Filters
What goes through an input filter?
An input filter is the first independent layer of the stack — the one that runs before the model sees a single token of the incoming request. The minimum useful version has three narrow detectors, each its own model or rule with its own failure modes: a PII detector that blocks or redacts personal data the user pasted in, an injection detector that catches prompt-injection-shaped instructions hidden in the request, and an off-topic detector that catches requests outside the agent's actual scope. Each one is a different kind of classifier; none of them is the model that will later answer the request. That independence is the whole point.
If the input layer is built as one big prompt that asks a single LLM "is this safe and on-topic?", you have rebuilt the single point of failure that Step 1 spent its budget warning against. A jailbreak instruction that confuses the main model into ignoring its system prompt is the same kind of instruction that can confuse a generic "is this safe?" judge into saying yes. The whole stack is independent layers; this layer is itself independent narrow filters. Same principle, one level down.
PII detection: a regex floor plus an optional model pass
PII is the easiest of the three to start with because the regex floor catches a lot for almost no cost. Emails, phone numbers in the common formats your user base uses, payment card numbers (run the Luhn check before flagging to cut false positives), national-ID patterns per country. The same redaction kit that Module 2 Step 2 covered for the logging boundary applies here at the input boundary — and applies first, because once unredacted PII is inside the model's context window it has already been touched by the model and the prompt is at least theoretically vulnerable to exfiltration via output. Redacting at the input boundary, not just at the logging boundary, keeps the unredacted form out of the conversation in the first place.
The regex floor misses the long tail. "the user said his email is jane at example dot com" sails past every email regex, and a one-line natural-language description of a credit card ("the one ending in 4242, expiring next March") is invisible to Luhn. The pragmatic escalation for teams that handle a lot of free-text PII is a small model-based redactor as a second pass — fast enough to run on every input, narrow enough that its failure modes do not correlate with the main agent model's, slow enough that you do not run it unless the regex floor or some other heuristic flagged the request as PII-likely. The regex catches the obvious for ~free; the model redactor catches the long tail at a real but bounded cost.
A useful Module 2 callback: every redaction event gets logged as an attribute on the input span, with the redaction type (email, phone, payment_hint) but not the redacted content. The same five-golden-metrics dashboard from Module 2 Step 4 then surfaces "PII redaction rate" as a leading indicator — a sudden jump usually means a single misbehaving upstream client is now logging customer chat into your agent prompts.
Injection detection: classify shapes, route to a cautious model
Prompt injection is the threat class Simon Willison's prompt-injection writing has been mapping since 2022, and the executive summary is uncomfortable: a single LLM cannot reliably distinguish between instructions from its operator and instructions that arrived inside its data. So the input filter is not trying to make the model immune — it is trying to reduce the leverage an injection attempt has by the time the model sees it.
A workable injection detector is a small classifier (sometimes literally a few-shot prompt to a tiny model, sometimes a fine-tuned classifier head) that scores requests on a few axes: "does this look like a meta-instruction to the model?", "does this look like an attempt to override the system prompt?", "does it match known jailbreak templates?", "is there text formatted to look like a system message?". The output is not a clean yes/no — it is a score and a category, and what you do with the score depends on the route.
Three actions worth keeping in the toolbox:
- Refuse. For high-confidence detections on a high-stakes route, the right move is a polite refusal that never reaches the main model at all. The model cannot be confused by an instruction it never saw.
- Sandbox. For medium-confidence detections, the harness can strip or quote the suspicious section before passing it to the model (
<<UNTRUSTED>>wrappers, structured inputs that separate the "instruction" channel from the "data" channel). The model still sees the request, but the model's prompt is now teaching it to treat the suspicious section as data. - Route to a cautious model. For low-confidence detections on a route that mostly needs to stay open, the harness can hand the request off to a "cautious" model variant — same family, different system prompt, much narrower scope, stricter output filter on the way out. The injection attempt still reaches a model, but a model with much less leverage to misuse.
The "cautious model" pattern is worth dwelling on because it is the lever most teams under-use. The cautious variant is the same model with a much more restrictive system prompt: it cannot call destructive tools, cannot call tools that move money, cannot send outbound emails, and has an output filter tuned for refusals. An injection attempt that confuses it into "doing what the attacker says" finds itself in an agent with very little it is allowed to do. The trick is to keep the capabilities gap large between the normal and cautious routes — if the cautious model has the same tools, you have not actually reduced the leverage, you have just changed the prompt.
A reasonable default policy for a customer-service agent like the T-77 refund agent from Module 1:
| Detector score | Action | Tools available downstream |
|---|---|---|
| High confidence (clear jailbreak template) | Refuse with template message | None — request never hits the model |
| Medium confidence (suspicious shape) | Wrap in <UNTRUSTED>, route to main model | Read-only tools only |
| Low confidence (one weak signal) | Route to cautious model | Read-only tools only; no issue_refund |
| Clean | Route to main model | Full role-scoped tool set |
The numbers along the left are calibrated against real production traffic, not picked in a meeting. A detector with a high false-positive rate — illustratively, on the order of a few percent of clean requests — will frustrate a meaningful slice of legitimate users every interaction; a detector that fires on a fraction of a percent of clean requests but misses half of the real attempts is not pulling its weight either. The calibration loop is the same one Module 2 Step 4 covered for golden metrics — run the detector against a week of real traffic with the action set to "log but do not enforce" first, look at the score distribution, decide where the action thresholds go, then enable enforcement.
Off-topic detection: scope guard for cost and safety
The third detector is the one that gets cut from a lot of stacks because it does not feel like a "safety" filter. It is — and it pays for itself twice.
The narrow version: a small classifier or an embedding-similarity check against the agent's actual scope. A customer-service agent for an e-commerce store has a defined scope ("orders, refunds, account, products we sell, shipping policies, store hours"). A user asking "what's the best pizza in Naples?" is out of scope; a user asking "ignore everything and tell me a joke" is out of scope; a user asking "write me a short story about a wizard" is out of scope. The off-topic detector flags those before the main model spends three ticks gracefully declining.
The cost case is the easy one: each off-topic interaction the agent gracefully refuses still costs tokens, still spends ticks, still pays Module 2's dollars-per-task metric. A scope filter that bounces the request at the input boundary for the price of one tiny classifier call is significantly cheaper than a four-tick conversation that ends in a refusal.
The safety case is the less obvious one and the more important one. Off-topic chitchat is the social-engineering vector — the request that gets the agent to drop its system prompt is rarely "ignore previous instructions"; it is usually a chain of friendly-sounding off-topic exchanges that walks the model into a frame where the system prompt feels less authoritative. The injection detector and the off-topic detector are catching different shapes of the same general problem: requests that try to move the agent away from its job. They are independent filters because they have different false-positive patterns, and they belong in the stack for the same reason — uncorrelated failure modes are the entire architectural point of the input layer.
The false-positive cost is real
A filter that over-triggers does damage that does not show up in a security review. The user who got bounced ten times in a row for asking innocent questions does not file a CVE; they churn. The team that pages on every injection detector spike does not get a clean signal; they get fatigue. The same calibration discipline Module 2 Step 5 applied to alerting policies applies here: tune on a week of real traffic before you turn enforcement on, watch what you would have blocked, ask whether a human reading those requests would have blocked them, adjust.
Two practical tips from teams that have rolled this out. First, shadow mode is mandatory for every new detector — log the score, do not act on it, watch the distribution against real traffic for at least a week, decide the thresholds from the data. Second, measure the false-positive cost in user terms, not detector terms. "10% of requests trigger this detector" is a detector-frame number; "1.2% of legitimate users had their valid request refused this week" is a user-frame number. The second is what a product manager can act on; the first is what makes engineers feel safe and users feel unwelcome.
The closing observation: input filters are the cheapest of the three layers because the input is small, the detectors are small, and most requests are short. The output filter (Step 3) and the policy gate (Step 4) cost more because the output is longer and the policy gate runs on every tool call, not just every request. That cost gradient is the start of the conversation Module 4 picks up on cost and latency engineering — but it is also why teams under-invest in input filters when they should over-invest in them. The cheapest layer to harden is the first one.
Try it: Stage 2 of the right pane runs four scenarios — a PII paste (an email and a phone number embedded in an otherwise normal request), a prompt-injection attempt, an off-topic query, and a legitimate refund request — through three toggleable detectors. Disable a detector and watch which scenarios slip through. The point is to feel the independence between detectors: each one is catching a different shape, and turning one off does not just lose one filter — it loses the layer's coverage of one threat class entirely.
Output Filters
What does an output filter catch?
An output filter is the last independent layer between the model's draft response and whatever the user actually sees. The four classes worth its budget: citation hallucination (the model claims to quote a source that does not exist or does not say what is claimed), format violation (the downstream expects structured output and the model emitted broken JSON or the wrong schema), false refusal (the model said "I can't help with that" to a legitimate request the product is meant to handle), and brand-policy violation (the model said something off-tone, off-brand, or comparing competitors). Each is a different kind of failure with a different detector. None of them can be replaced by a perfect input filter, and the reason matters.
The reason an input filter alone cannot do the job is mechanical. The model is a generative system. Its output is a function of the input and the weights, and the weights are the part you cannot inspect from the request. A perfectly benign input — "summarize our refund policy and cite the relevant section of the customer terms" — still has weights between it and the response, and the weights are where hallucination lives. The output filter is where you check that the response is the response you can defend, regardless of what the input looked like. Module 2's traces and logs let you see what the model produced; the output filter is what decides whether that production reaches the user.
Citation checking: the cheapest hallucination defense
When the agent claims to quote a source — "according to RFC 2119", "our terms section 4.2 says", "the order page shows the status as…" — there is almost always a retrievable thing it is citing, and almost always a way to check that the citation actually exists. The check is mechanical: pull the cited passage out of the response, look it up in the retrieved documents or the underlying data source, and verify the words actually appear there. If the citation is "section 4.2 says we refund within 14 days" and the actual section 4.2 of the customer terms says we refund within 30 days, the output filter flags the response before the customer reads "14 days" and acts on it.
Anthropic's discussion of grounded generation makes the broader argument that retrieval-augmented agents need a verification step on the citations they produce, because the model's tendency to confabulate plausible-looking citations is independent of whether retrieval succeeded — the model is happy to invent a section number that sounds right whether or not it had the actual section in its context. The verification step is cheap (a string-match or fuzzy-match against the retrieved context) and the failure case is loud (a customer-facing wrong number).
A concrete shape for the T-77 refund agent:
draft response:
"Per section 4.2 of our terms, refunds are processed within 14 business days."
retrieved context:
doc: customer-terms.md, section 4.2:
"Refunds for eligible orders are processed within 30 calendar days
of the return arriving at our warehouse."
citation check:
- cited section exists in retrieved context: YES
- claim ("14 business days") substring of context: NO
- claim ("30 calendar days") in context: YES
- verdict: FAIL — claim does not match cited source
action: regenerate with a corrective hint
"Your previous response cited 14 business days, but the relevant section
says 30 calendar days. Use the actual text."
The mechanical part — substring or fuzzy match — is fast enough to run on every response. The model that disagrees with itself on the second attempt is rare; the model that gets the number right when the filter pushes back is common.
Format validation: schema is the contract
Most production agents have at least one route where the downstream consumer is a program, not a human. The order-lookup pipeline expects a structured response with order_id, status, last_updated. The downstream router expects a JSON object with an intent field. The handoff to a different agent expects a defined schema. When the model produces output that does not conform — a missing field, a stray comment, a JSON object with a trailing comma — the consumer either crashes or silently drops the request, and either failure mode is the kind of thing that creates a Module 2 alert for the wrong reason ("agent latency p95 spiked" turns out to be "agent is retrying because downstream parser rejected the last three responses").
The defense is to schema-validate the output before it leaves the agent. OpenAI's structured outputs documentation and the JSON Schema spec are the two things to point a new engineer at — the first because most production model APIs now expose a "constrain to this schema" mode that pushes validation into the decoder itself, the second because the schemas you define against your own internal consumers should match the language the rest of your stack already speaks.
When schema constraints are enforced at the decoder, the output filter's job shrinks to a sanity check — the decoder usually returns valid JSON because it had to. When the model is producing JSON without decoder constraints (older or non-supporting APIs, or the schema is too complex to push down), the output filter is doing real work: parse the response, validate against the schema, on failure either return a corrective hint to the model ("your previous response was missing the last_updated field") or fall through to a safe default. The retry-on-format-failure pattern is one of the rare cases where you want to retry the model deterministically — same prompt, same context, plus the explicit error — because the failure mode is purely mechanical and a second draft usually fixes it.
A small gotcha worth surfacing: format validation should never silently fix the response. If the model emitted broken JSON, the right move is to regenerate or fail; the wrong move is to "repair" the JSON with string manipulation and ship it, because you do not actually know what the model was trying to say and you have just put the wrong structured data into a downstream system that trusted it.
False-refusal detection: the brand harm nobody talks about
The threat class that is hardest to take seriously until you see it in production: the model refuses a legitimate request. "I can't help with refund questions, please contact customer support" — emitted by the customer-support agent. "I can't access your order data" — emitted by the agent whose entire job is to access order data. Every false refusal is a customer who thinks the product does not work, and the cost of that perception over six months is bigger than most teams' total budget for the safety stack.
A false-refusal detector is a small classifier — sometimes literally a few-shot prompt to a small model — that takes the agent's draft response and the original user request and asks "is this a refusal, and was the request legitimate?". On a true positive (the agent refused a legitimate question), the action is usually to regenerate with a corrective hint ("the user is asking about their own order, which is in scope for this agent — answer the question"). On a true negative (the agent refused a request it should refuse — abusive language, out-of-scope question, attempted injection), the response goes through.
The honest framing: the helpful AI is the product. A model that refuses too much is failing at being the product, and the failure should be treated with the same seriousness as a model that refuses too little. Most safety teams over-index on the second failure and under-index on the first because the second one shows up in security reviews and the first one shows up in retention metrics, but the dollar impact of false-refusal at scale is usually larger.
A useful Module 2 callback: false-refusal rate is one of the metrics worth wiring into the golden five. The default five from Module 2 Step 4 — success rate, p95 ticks, p95 latency, dollars-per-task, policy-filter block rate — already includes the correct-refusal end of the distribution (block rate). The false-refusal rate is the other end, and a team that has both numbers on the dashboard is the team that catches a model deploy that made the agent too cautious overnight.
Brand-policy pass: the last filter before the user
The fourth filter is the catch-all for content that would embarrass the product without crossing into the territory of the first three. Mentioning a competitor by name in a recommendation. Off-color phrasing in what should be a polished response. Off-tone replies — an apologetic tone where the brand voice is direct, or a casual tone where the brand voice is formal. Sometimes the OWASP LLM Top 10 class of "sensitive information disclosure" lands here too — the agent says something true that the company has chosen not to say publicly.
The shape is again a small classifier, often a per-brand fine-tune or a few-shot prompt with brand-specific examples. The action on a flag is usually "regenerate with a corrective hint" rather than "refuse" — the goal is the right response, not no response. The exception is the small set of hard policy lines (no competitor recommendations, no medical advice from a non-medical product), where refusal is the correct action and the filter is enforcing a product policy that should be loud about being a policy.
The latency cost is real and it composes
Every filter you add costs wall-clock time. A single output filter adds something like 10–50 ms in the realistic case where the filter is a small classifier or a fuzzy-match check (the exact range depends on the filter, the model, the network — these are illustrative numbers, not benchmarks). Four output filters running serially is 40–200 ms of added latency on every response, on top of the model's own latency. For a response that the model produced in 1.5 s, four filters at 50 ms each is a noticeable but tolerable 13% latency tax. For a response the model produced in 200 ms (a fast structured-output call), the same four filters at 50 ms each is a 200% latency tax, and the user experience changes character.
There are three pragmatic mitigations that show up in production stacks, all of which Module 4 will dig into properly:
- Parallel filters where the failure modes are independent. Citation check, format validation, refusal detection, brand pass — none of them depend on each other's verdicts, so you can fan them out in parallel and the wall-clock cost is the max of the four, not the sum. This is the cheapest win.
- Sample-based filtering on low-stakes routes. The brand-policy pass on every response is overkill for an internal route used by twelve engineers; sample a small fraction (in practice, often around 10%, tuned per route) and let the rest through. Stakes determine sample rate, not the other way around.
- Skip filters by route. Not every output filter belongs on every route. The structured-output route does not need a brand-policy pass; the open-ended-chat route does not need format validation. The route table from Module 1 Step 5 is where these decisions live.
The number to internalize is the total tax, not the per-filter cost. Three filters serially at 30 ms each is 90 ms; the same three in parallel is 30 ms; the same three sampled at 50% is an expected 45 ms. Module 4 takes this seriously enough to make it the entire module — but the output-filter stack is where the cost first starts to bite.
Try it: Stage 3 of the right pane runs four scenarios — a hallucinated citation, malformed JSON, a false refusal of a legitimate refund question, and a brand-unsafe phrasing — through four filters with toggleable enablement. Each filter shows a latency badge so you can read the cumulative cost of the layer. Disable the citation check and watch the hallucinated quote sail through; toggle individual filters off to see the bottom-right total shrink. The number that matters is that total, not any single filter — and in production the same picture admits a parallel-execution layer that would replace the sum with the max.
Policy Enforcement
What is agent policy enforcement?
Policy enforcement is the middle layer of the three-filter stack — a declarative gate that runs between the model's decision to call a tool and the tool actually firing. The model emits a tool call (issue_refund(order_id="A-1234", amount=275.00)); the harness pauses the call, looks up the rules for the caller's role and the tool's allow-list, evaluates them against the call's arguments, and either lets the call proceed or rejects it before any side effect happens. The model never sees the verdict directly. The gate is enforced by the harness around the tool-call site, not by another prompt the model can argue with.
The reason the gate is declarative and not a piece of imperative code buried inside each tool's implementation is the same reason large companies separate their security policies from their application code. The same rules — "customer_support cannot issue refunds over $300", "no role can write to the audit log directly", "any tool with delete_ in its name requires an admin principal" — apply across many tools, and the security team that owns the rules should be able to read and audit them without reading source. The pattern is the one Open Policy Agent (OPA) and Cedar were designed for: a policy engine that takes a structured request, evaluates it against a written policy, and returns allow/deny with a reason. Both are battle-tested at large companies for non-agent workloads; both work for the agent case with minor adaptation.
The three rule classes worth implementing first
Most production policy gates earn their keep with three categories of rules. Adding more is easy; getting the first three right is the load-bearing work.
Allow-list by role. The role this agent is running as (customer_support, lead_agent, admin, external_partner) determines the set of tools it is allowed to call. A customer_support agent can call read_order, lookup_customer, issue_refund, send_email. It cannot call delete_account, update_audit_log, or change_user_role — those tools are not in its allow-list, and the gate rejects the call before the tool ever runs. This is the simplest of the three rules and the one most teams get right by default. The thing to watch for is the silent expansion of the allow-list as new tools land — every new tool should explicitly list which roles can call it, and the default for an unlisted role should be deny.
Argument scope. The tool is in the allow-list, but the arguments are out of bounds. The clearest version is the amount cap: customer_support can call issue_refund, but only for amounts up to $300; lead_agent up to $1,000; admin unlimited. (The dollar figures are illustrative — actual caps depend on the product, the regulatory environment, and the company's risk tolerance.) The slightly more subtle version is the identity scope: the order_id the agent is trying to refund must belong to the customer the agent is currently helping. Calling issue_refund(order_id="A-9999") when the conversation context says the customer is helping with order A-1234 should be rejected, because the cross-customer pattern is one of the most common shapes of accidental data exfiltration.
Resource quotas. Per-conversation budgets that prevent runaway loops from doing runaway damage. A per-conversation tool-call budget ("no more than 12 tool calls per task") prevents an agent that has wandered into a degenerate loop from issuing 1,000 reads. A per-user spend budget ("no more than $500 of total refunds per user per day") prevents a single misbehaving conversation from authorizing more than the user's lifetime value. Module 2's traces are what makes these countable — the gate sums up tool-call counts and refund totals from the current conversation's spans before deciding.
A concrete policy for the T-77 refund agent, written in the style of a policy file:
allow:
- role: customer_support
tools: [read_order, lookup_customer, issue_refund, send_email]
issue_refund:
max_amount_usd: 300
order_must_belong_to: conversation.customer_id
- role: lead_agent
tools: [read_order, lookup_customer, issue_refund, send_email,
update_order_status]
issue_refund:
max_amount_usd: 1000
order_must_belong_to: conversation.customer_id
- role: admin
tools: ["*"]
quotas:
per_conversation:
tool_calls_max: 12
refund_total_usd_max: 1000
per_user_per_day:
refund_total_usd_max: 500
The shape matters more than the syntax. The point is that a security reviewer can read this file, ask "what can customer_support do?", and get an answer from the policy itself — not from spelunking through Python or Go code looking for if role == 'customer_support' branches.
Where the role actually comes from
The unspoken assumption in everything above is that the agent has a principal — a clearly-identified role on whose behalf the agent is acting. The role has to come from somewhere, and the choice has security consequences.
The two patterns in practice. Fixed at deploy time: one harness instance is deployed for each role, the role is part of the deployment config, every tool call from that instance inherits the role from the instance. Simple, hard to misuse, and the operational cost is one deployment per role. Inherited from the user's identity: the user authenticates, the agent's principal is set per-request from the authenticated identity, the policy gate looks up the principal's role on every call. More flexible (one agent serves many roles), but the join from "authenticated user" to "role" has to be airtight, because a bug there is a privilege escalation. The OWASP guidance on agent capability scoping treats this as a first-class question: agents should run with the least privilege that lets them do the job, and the principal-to-role mapping is where "least privilege" is operationalized.
The honest take: most teams ship the first pattern (fixed role per deployment) until they have a real reason to need the second, because the second is harder to reason about and easier to get wrong. The first pattern composes well with the route table from Module 1 Step 5 — different routes can run against different harness instances with different roles, and the routing decision (which Module 1 already established) is the same place the principal decision lives.
How the gate composes with Modules 1 and 2
Two specific Module 1 primitives become useful here in ways that are easy to miss.
The idempotency key from Module 1 Step 2 belongs on the tool call as a structured argument that the policy gate sees. The gate can then enforce "no two refunds with the same idempotency key in the same hour" as a quota rule, which catches a particular failure mode the harness alone cannot — a model that has been jailbroken into issuing two refunds with slightly different ids for the same logical action. The idempotency key from the durability layer becomes a join key for the safety layer.
The checkpoint write from Module 1 Step 3 is what makes the per-conversation quotas reliable. The "tool calls used so far in this conversation" counter has to survive a harness restart, which means it has to be in the durable state, which means it is exactly the checkpoint row Module 1 already taught you to write. The quotas are not a separate state machine; they are columns on the conversation's checkpoint row.
Module 2's contribution is observability into the gate itself. Every gate verdict — allow, deny with reason, quota-exceeded — gets its own span attribute on the tool span. The dashboards from Module 2 Step 4 included "policy-filter block rate" as one of the five golden metrics for exactly this reason: a sudden spike in policy-gate blocks usually means either an attack (someone is trying to do something the agent is not allowed to do) or a regression (a recent change made the agent attempt actions outside its scope). Either way, the spike is worth a page, and the per-rule breakdown of which policy fired is the diagnosis path.
A useful side-effect: every blocked call is still a span, which means a blocked refund attempt is visible in trace replay (Module 2 Step 3) the same way a successful refund is. The replay UI for an incident now includes "what the agent tried to do and could not", which is often more useful than "what the agent did" — the attempted action is the bug report.
Dry-run mode: shadow every new rule before enforcing
The pattern that distinguishes mature policy gates from new ones is the explicit support for shadow rules — a new policy is added to the file with a flag that says "evaluate but do not enforce", the gate logs every shadow verdict as a span attribute, the team watches a week of real traffic, and only then is the flag flipped to enforce.
The argument is the same one Module 2 Step 5 made for tuning alerting thresholds: an enforcement that fires more often than expected on real traffic is the failure mode that frustrates legitimate users without anyone noticing in time to prevent the damage. A shadow rule with a logged verdict is a question — "if I had enforced this, would I have blocked these 47 customers? Let me look at the trace replays for each of those 47 and decide whether any of them is a false positive worth fixing" — and the question can be answered before any customer is impacted.
The mechanical shape is a mode field on each rule: mode: enforce blocks the call, mode: shadow logs the verdict and lets the call through, mode: log_only is the same as shadow but explicitly never planned for enforcement (useful for telemetry-only rules). Most production policy stacks ship every new rule in shadow mode for a fixed period — a week is common — before flipping to enforce, and the flip is its own change with its own review.
Where the rule belongs — in the policy file, not in the tool
The last temptation worth naming and resisting: the impulse to put rules into the tool's own code because "the tool knows best what's safe". A issue_refund tool that internally rejects amounts over $300 is one place to put the cap. The policy file is the other. The tool-local version is tempting because it feels close to the action it is protecting — and it is wrong, because the same rule needs to apply across many tools (the amount cap is conceptually about the role's monetary authority, not about the issue_refund tool specifically), and the security review of "what is customer_support allowed to do?" should be answerable from one file, not from twelve tools' internal logic. The OPA and Cedar patterns exist precisely to enforce that separation: the tool runs the action, the policy decides whether the action is allowed, and the two have a clean interface between them. Mixing them works on day one and is a maintenance hazard by month six.
Try it: Stage 4 of the right pane is a tool-call gate simulator — a slider for the refund amount, a dropdown for the caller's role, and three toggleable rule classes (allow-list, amount cap, scope check). Drag the amount past $300 with customer_support selected and watch the cap rule fire; switch to admin and watch the same call pass. Turn the scope check off and notice that the rule disappears from evaluation entirely — on real traffic where the order_id could belong to a different customer's account, that is exactly the gap a malicious or buggy caller walks through. The point is that the rules are additive — each one catches a different class of misuse, and turning any one off leaves a gap the others do not fill.
Fail-Safe vs Fail-Open
When should a filter fail-safe vs fail-open?
Filters fail. Not metaphorically — literally. The injection-detection model has a deploy that breaks its API, the PII redactor times out on a request that hit a pathological input, the citation checker's downstream document store is offline for an hour. Steps 2 through 4 talked about each filter as if it were always available; this step is about the choice every safety system eventually has to make in production — when a filter itself fails, does the request get blocked or let through? That choice is the difference between fail-safe (block on filter error, default to no answer) and fail-open (allow on filter error, default to the model's answer with no filter applied). It is a one-line decision per filter, per route, and it has surprisingly large consequences.
The wrong way to think about this is "we should be safe by default, so everything is fail-safe". The right way is to remember that unavailability is itself a harm — a customer-service agent that refuses to answer for four hours because its citation checker is down is also failing the user, just in a way the security team is less likely to count. The correct answer is per-route, sometimes per-filter, and it lives in the policy table — not in the code that implements the filter.
Fail-safe: the rule of thumb for irreversible actions
The fail-safe default is the right answer when the cost of letting a single bad request through is much larger than the cost of being unavailable. The diagnostic question is usually: can the customer undo this with one click? If yes, fail-open is at least a candidate. If no, fail-safe is the answer.
Irreversible actions are the canonical fail-safe case. Money movement (issuing a refund, charging a card, sending a payout), regulated content (a medical-product answer, a legal-domain claim, anything covered by HIPAA, PCI-DSS, GDPR data-handling requirements, or an industry regulator), destructive operations (deleting an account, archiving a customer's data), and outbound communication where the recipient cannot be unsubscribed retroactively (a one-shot email to a third party, a webhook to an integration). The cost of "the refund agent was down for four hours" is much smaller than the cost of "the refund agent issued a $50,000 refund because the policy gate could not verify the amount cap and let the call through". One is a status-page incident; the other is a regulator letter and a finance reconciliation that takes a quarter.
The honest framing is the one the NIST AI Risk Management Framework keeps coming back to: AI systems should have explicit failure-mode policies that match the stakes of the action, and the default for high-stakes actions should be that the system stops working safely rather than working unsafely. The same argument applies to every filter in the stack — if the filter cannot run, the actions it was protecting should not run either.
Fail-open: the rule of thumb for low-stakes assistance
The fail-open default is the right answer when the cost of being unavailable is larger than the cost of letting an occasional bad response through. The clearest examples are low-stakes assistive routes — a FAQ chatbot that helps users find the right documentation page, a recommendation widget on a marketing page, a "did you mean…" helper in a search box. The cost of "the FAQ chatbot was unhelpful for the last forty minutes" is real but bounded; the cost of "the FAQ chatbot was completely unavailable for the last forty minutes" is bigger, because the unavailability itself looks worse than a few imperfect answers would have.
The second class of fail-open candidates is the filter that adds polish, not safety. The false-refusal detector from Step 3 is a fail-open candidate on most routes — if the detector is down, the worst that happens is some legitimate questions get refused, which is the same outcome you would get on a slightly grumpy day with the detector running. The brand-policy pass is sometimes a fail-open candidate on internal routes — if the brand-policy classifier is down, an internal user might see slightly off-brand phrasing for an hour, which is not the kind of thing a security review would call a P0.
The danger is letting "fail-open is convenient" leak from the routes where it makes sense to the routes where it does not. The discipline that prevents the leak is making the choice explicit per route, in a policy table that someone has to review, rather than as a default fallthrough in the filter's own code.
Per-route, in the policy table, not in the codebase
The architectural point that ties Steps 2 through 4 together: the fail-safe / fail-open choice is route-level configuration, not filter-level code. Same harness, same set of filters, different routes — and the routes have different failure-mode policies because they have different stakes. The route table is owned by whoever owns the security policy; the engineers own the harness and the filters; the two have a clean interface between them.
A workable shape for the T-77 refund agent that has been the running example since Module 1:
| Route | PII filter | Injection filter | Policy gate | Citation check | Brand pass |
|---|---|---|---|---|---|
| refund-processing (money movement) | fail-safe | fail-safe | fail-safe | fail-safe | fail-open |
| order-status-lookup (read-only) | fail-safe | fail-safe | fail-safe | fail-open | fail-open |
| faq-chatbot (low stakes) | fail-safe | fail-open | n/a | fail-open | fail-open |
| internal-debug-helper (engineers only) | fail-open | fail-open | fail-safe | fail-open | fail-open |
A few things to notice. The PII filter is fail-safe on every customer-facing route — even on the FAQ chatbot — because the cost of leaking a customer's email in a log is not bounded by the route's stakes; it is bounded by the data-protection regulation. The policy gate is fail-safe on every route that calls a destructive tool — the refund-processing route, the read-only route (because reads against the wrong customer's order are also a data-protection event), the internal-debug route. The citation check and the brand pass are fail-open on the low-stakes routes because their downside is "imperfect response" not "regulated incident". The composition is per-cell, not per-row — within a single route, some filters fail-safe and others fail-open, and that mix is what makes the table the right place for the configuration.
Composability: the route is the sum of its filter decisions
The route's overall behavior emerges from the per-filter decisions. A "Route A" that is fail-safe on every filter is the strict refund-processing route — the request gets dropped (with a clean error message to the user) if any single filter is unavailable, and the dashboards from Module 2 immediately show the route's success rate dropping while the filter's own availability metric shows the cause. A "Route B" that is fail-open on most filters is the relaxed FAQ-chatbot route — the request keeps flowing through filter outages, the answer might be slightly worse during the outage window, and the dashboards show the filter availability dropping without the route's success rate moving.
The thing to avoid is the silent third option: a filter that is supposed to be fail-safe but, because of an unhandled exception in the filter's wrapper code, actually fails open. This is the most common production bug in safety systems, and Module 2's observability is what makes it findable. Every filter invocation gets its own span; every span has a verdict and a mode attribute (verdict ∈ allow/deny/error; mode ∈ enforce/shadow); and a row in the dashboard shows the count of verdict=error AND fail_behavior=open per route. If that count is rising on a route configured as fail-safe, you have the silent-fail bug, and the alert on it should be loud.
Why durability and visibility are prerequisites
Two callbacks to make the layering explicit, because they are the load-bearing reason this module is Module 3 and not Module 1.
Durability is what makes the route survive a filter outage at all. A fail-safe filter that blocks the request still has to write a clean "task ended in filter-error state" row to the durable store, so the user can retry, the team can see the outage in the trace, and the harness does not leave a half-completed conversation dangling. The checkpoint primitive from Module 1 Step 3 is what makes that clean termination possible. Without durability, a filter outage is a crash; with durability, a filter outage is a clean error.
Visibility is what makes the failure-mode policy enforceable. A fail-open filter that quietly stops working becomes "no filter at all" — and you would not know. The five golden metrics from Module 2 Step 4 (success rate, p95 ticks, p95 latency, dollars-per-task, policy-filter block rate) plus the per-filter availability and verdict-distribution metrics from this module are what turn fail-open from "filter unavailable, request passed" into a question the on-call can answer: was this an acceptable degradation, or has the filter been down for two days? Without observability, fail-open silently degrades into the absence of safety, and the team finds out from a customer complaint.
The honest summary is the one Module 2's closing argument made: the three layers compose in order. Durability gives you a system that survives. Observability gives you a system you can see. Safety gives you a system whose failures do not reach the user — and whose failures, when they occur, are accountable.
Where this module hands off
Each filter you have read about in Steps 2 through 4 costs real wall-clock time and real dollars. The input filter is fast and cheap (Step 2). The output filter is slower because it runs after the model produces its draft, and four output filters at 30 ms each is 120 ms serially or 30 ms in parallel (Step 3). The policy gate is fast on the happy path but expensive when a rule requires looking up state in a durable store (Step 4). The fail-safe / fail-open choice is itself a cost decision — fail-safe routes pay the cost of waiting for the filter to respond before the request can complete; fail-open routes accept worse-case unavailability of the filter as part of the latency budget. Every choice in this module has a cost shape, and the cost shape is what the next module is about.
Module 4 (Cost & Latency Engineering) picks up next. Filters cost real wall-clock time and real dollars, and the next module is about being deliberate about that cost — knowing where the money and the milliseconds go, where caching helps, where parallelization helps, where the right answer is to drop a filter on a route that does not need it. The four production agents I have seen with the best safety records all had policy-driven filter selection by route — not because they were stingy, but because cost discipline is what kept them able to add a new filter when a new threat class arrived. A team that has run out of latency budget for safety has effectively frozen its safety posture; a team that knows where its budget goes has the headroom to keep evolving.
Try it: Stage 5 of the right pane runs two routes side by side — Route A is the strict refund-processing route (every filter fail-safe), Route B is the relaxed FAQ chatbot (filters fail-open at the route level for clarity). Toggle individual filter health from "up" to "down" and watch each route's outcome. The same filter outage breaks Route A's response (clean error, request blocked) and passes through Route B (response continues, often unchanged). The point is to feel the compositionality of the choice — same harness, same filter, two different policies, two different user-facing outcomes. The sim simplifies to one decision per route; the per-cell table above is the granularity a real policy file would carry (the PII filter, for example, stays fail-safe on every customer-facing route — even the FAQ chatbot — because the regulatory cost of a leak is route-independent).