What is a guard model for agent actions?

A guard model is a small, dedicated classifier that sits inline in an agent's loop and screens each action — a tool call, a shell command, a code-execution request — as safe or risky before it runs. It is separate from the agent LLM doing the work; its only job is the allow/block decision. AgentDoG 1.5 trains such guards at 0.8B, 2B, 4B, and 8B parameters so the screen can run as a cheap sidecar rather than a heavyweight closed safety model.

Why does AgentDoG only need about 1,000 training samples?

It uses a taxonomy-guided data engine to synthesize candidate cases from a structured catalogue of agent risks (including code execution), then applies influence functions to keep only the examples that measurably improve the model and discard the rest. The result is a small, high-signal casebook of roughly 1,000 samples instead of millions, trained in an SFT-then-RL agentic-safety setup. Quality and coverage of the taxonomy matter more than raw sample count.

How does a guard model relate to the lethal trifecta?

The lethal trifecta is private-data access plus untrusted content plus an exfiltration channel — dangerous only when all three combine. A guard model is one way to cut a leg off that triangle: even when an agent holds private data and reads untrusted input, the guard can refuse the specific action that would leak it. Because a small guard is cheap to run on every action, it slots into a defense-in-depth stack alongside capability scoping and data-flow review rather than replacing them.

AgentDoG 1.5 — Small inline guard models for agent actions

Agent

learnaivisually.com/ai-explained/agentdog-1-5-inline-guard-models

Jargon

Guard model: A separate, dedicated classifier that screens an agent's inputs and actions for risk — distinct from the agent LLM doing the task. It lives inline in the loop and returns an allow / block decision per action. See Agent Engineering → Output Filters.
Lethal trifecta: The three ingredients that make an agent dangerous together: private data access, exposure to untrusted content, and an exfiltration channel. A guard model is one way to break the chain. See AI Agents → The Lethal Trifecta.
Influence functions: A method that estimates how much each training example actually helps the model. AgentDoG uses it to purge low-value samples so that roughly 1,000 high-signal cases remain — the "purification" step.
Taxonomy-guided data engine: A pipeline that synthesizes training examples from a structured catalogue of risk categories. AgentDoG's taxonomy is updated to explicitly cover code-execution risks, not just text-level prompt attacks.
SFT + RL: Supervised fine-tuning (learn from labeled examples) followed by reinforcement learning (learn from reward in an environment). AgentDoG trains its guards in an "agentic safety" SFT+RL setup so they see realistic action traces, not isolated prompts.
Defense-in-depth: Layering independent filters so no single failure is fatal — input filters, output filters, policy checks, fail-safe defaults. A cheap guard model makes it affordable to add more layers. See Agent Engineering → Defense-in-Depth.

The news. On May 29, 2026, researchers posted AgentDoG 1.5, a lightweight alignment framework for agent safety. It trains guard models at 0.8B, 2B, 4B, and 8B parameters on roughly 1,000 samples, using a taxonomy-guided data engine (now covering code-execution risk) with influence-function purification. The paper reports performance comparable to leading closed models such as GPT-5.4 on agent-risk screening, while cutting Docker-level deployment overhead by about two orders of magnitude. Read the preprint →

Picture a doorway with a guard. Every visitor — a delivery, a contractor, someone who insists they were invited — stops at the door and the guard makes one call: come in, or turn around. That is exactly the job of a guard model in an agent system. The agent emits an action — call this tool, run this shell command, execute this code — and before the action reaches the real world, a small dedicated model screens it. The rookie at the door is not the chief of security; it is a 0.8B-to-8B model whose only skill is clearing safe actions and stopping risky ones, fast. The expensive alternative is to put the veteran chief on the door: a large closed safety model that is just as sharp but costs far more to keep standing by for every action.

What makes the rookie good is the casebook, not the size of its brain. AgentDoG's guards are not trained on millions of examples; they are trained on roughly 1,000. A taxonomy-guided data engine synthesizes candidate cases from a structured list of agent risks — and crucially the taxonomy is extended to cover code execution, the place where a steered agent does the most damage. Then influence functions estimate which of those synthesized cases actually move the model and throw the rest away, the way you would keep the three case files that taught a real lesson and bin the hundred that were routine. The cleaned set trains the guard in an SFT + RL loop that shows it realistic action traces rather than isolated prompts.

In the layered-defense picture, a guard model is an input and output filter standing in the agent loop. It is one concrete way to cut a leg off the lethal trifecta: even when an agent has private data and reads untrusted content, the guard can refuse the action that would exfiltrate it through a tool call. Because the guard is cheap, you can afford to run it on every action and still add other layers — which is the whole point of defense-in-depth. The design choice that remains yours is what the guard does when it is unsure: fail-safe (block and ask) is the conservative default, fail-open (allow) trades safety for uptime.

How the guard sizes stack up

Guard variant	Params	~4-bit footprint	Where it fits
AgentDoG-0.8B	~0.8B	~0.4 GB (derived: 0.5 byte/param)	Sidecar on the same GPU, or even CPU
AgentDoG-2B	~2B	~1 GB (derived)	Sidecar on the agent's GPU
AgentDoG-4B	~4B	~2 GB (derived)	Sidecar on the agent's GPU
AgentDoG-8B	~8B	~4 GB (derived)	Shares one GPU with the agent
Closed safety model	tens of billions (setup-dependent, illustrative)	~100+ GB (illustrative)	Its own Docker-sandboxed service

Where the ~100× actually comes from

Hold the catch rate fixed — the paper's claim is that the small guard matches the closed model there — and the win is in Docker-level deployment overhead: the standing service and sandbox each screen needs. Walk a back-of-envelope version (illustrative footprints; the preprint reports the overhead ratio, not these absolutes). A frontier-scale closed safety model runs to tens of billions of parameters — on the order of ~100 GB in FP16 — and is typically deployed as its own sandboxed service. AgentDoG-8B in 4-bit is about 8B × 0.5 byte ≈ 4 GB, and the 0.8B variant is under 0.5 GB — small enough to ride as an in-process sidecar next to the agent rather than as a separate service. That difference — a separate Docker-sandboxed service versus an in-process sidecar — is the roughly two-orders-of-magnitude (~100×) deployment-overhead cut the paper reports, and it is what makes screening every action affordable instead of sampling a few.

The catch, and the reason a small guard is not a free win: a model trained on ~1k cases only knows the risks in its taxonomy, so coverage gaps are real, and a determined attacker can probe for the action it does not recognize — the same evasion pressure that the camouflage-injection detection gap explainer describes. The parity-with-GPT-5.4 number is the paper's reported result, not an independent reproduction, and "comparable on a benchmark taxonomy" is narrower than "as safe in the wild." Treat AgentDoG as a cheap, always-on layer — not a replacement for capability scoping and a real data-flow review.

Goes deeper in: Agent Engineering → Layered Guardrails → Output Filters

Related explainers

Camouflage-injection detection gap — the attack a guard model has to catch: prompt injection hidden where a screener does not look
Copilot/Cowork image-URL exfiltration — a concrete exfiltration channel an output filter is meant to block
Glasswing — detection in a saturated pipeline — what changes when the screener itself becomes the bottleneck

Continue in trackAI Agents — Security & the Lethal Trifecta

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based