AgentDoG 1.5 — Small inline guard models for agent actions

Agent
L
Inline guard model — same catch rate, ~100× less deploy overheadAgentemits actionstools · shellcode-exec (Docker)closed safety modelagent action (tool call / shell)every action screened heredeploy overhead~100× (Docker-scale)agent-risk caught≈ closed model
learnaivisually.com/ai-explained/agentdog-1-5-inline-guard-models

The news. On May 29, 2026, researchers posted AgentDoG 1.5, a lightweight alignment framework for agent safety. It trains guard models at 0.8B, 2B, 4B, and 8B parameters on roughly 1,000 samples, using a taxonomy-guided data engine (now covering code-execution risk) with influence-function purification. The paper reports performance comparable to leading closed models such as GPT-5.4 on agent-risk screening, while cutting Docker-level deployment overhead by about two orders of magnitude. Read the preprint →

Picture a doorway with a guard. Every visitor — a delivery, a contractor, someone who insists they were invited — stops at the door and the guard makes one call: come in, or turn around. That is exactly the job of a guard model in an agent system. The agent emits an action — call this tool, run this shell command, execute this code — and before the action reaches the real world, a small dedicated model screens it. The rookie at the door is not the chief of security; it is a 0.8B-to-8B model whose only skill is clearing safe actions and stopping risky ones, fast. The expensive alternative is to put the veteran chief on the door: a large closed safety model that is just as sharp but costs far more to keep standing by for every action.

What makes the rookie good is the casebook, not the size of its brain. AgentDoG's guards are not trained on millions of examples; they are trained on roughly 1,000. A taxonomy-guided data engine synthesizes candidate cases from a structured list of agent risks — and crucially the taxonomy is extended to cover code execution, the place where a steered agent does the most damage. Then influence functions estimate which of those synthesized cases actually move the model and throw the rest away, the way you would keep the three case files that taught a real lesson and bin the hundred that were routine. The cleaned set trains the guard in an SFT + RL loop that shows it realistic action traces rather than isolated prompts.

In the layered-defense picture, a guard model is an input and output filter standing in the agent loop. It is one concrete way to cut a leg off the lethal trifecta: even when an agent has private data and reads untrusted content, the guard can refuse the action that would exfiltrate it through a tool call. Because the guard is cheap, you can afford to run it on every action and still add other layers — which is the whole point of defense-in-depth. The design choice that remains yours is what the guard does when it is unsure: fail-safe (block and ask) is the conservative default, fail-open (allow) trades safety for uptime.

How the guard sizes stack up

Guard variantParams~4-bit footprintWhere it fits
AgentDoG-0.8B~0.8B~0.4 GB (derived: 0.5 byte/param)Sidecar on the same GPU, or even CPU
AgentDoG-2B~2B~1 GB (derived)Sidecar on the agent's GPU
AgentDoG-4B~4B~2 GB (derived)Sidecar on the agent's GPU
AgentDoG-8B~8B~4 GB (derived)Shares one GPU with the agent
Closed safety modeltens of billions (setup-dependent, illustrative)~100+ GB (illustrative)Its own Docker-sandboxed service

Where the ~100× actually comes from

Hold the catch rate fixed — the paper's claim is that the small guard matches the closed model there — and the win is in Docker-level deployment overhead: the standing service and sandbox each screen needs. Walk a back-of-envelope version (illustrative footprints; the preprint reports the overhead ratio, not these absolutes). A frontier-scale closed safety model runs to tens of billions of parameters — on the order of ~100 GB in FP16 — and is typically deployed as its own sandboxed service. AgentDoG-8B in 4-bit is about 8B × 0.5 byte ≈ 4 GB, and the 0.8B variant is under 0.5 GB — small enough to ride as an in-process sidecar next to the agent rather than as a separate service. That difference — a separate Docker-sandboxed service versus an in-process sidecar — is the roughly two-orders-of-magnitude (~100×) deployment-overhead cut the paper reports, and it is what makes screening every action affordable instead of sampling a few.

The catch, and the reason a small guard is not a free win: a model trained on ~1k cases only knows the risks in its taxonomy, so coverage gaps are real, and a determined attacker can probe for the action it does not recognize — the same evasion pressure that the camouflage-injection detection gap explainer describes. The parity-with-GPT-5.4 number is the paper's reported result, not an independent reproduction, and "comparable on a benchmark taxonomy" is narrower than "as safe in the wild." Treat AgentDoG as a cheap, always-on layer — not a replacement for capability scoping and a real data-flow review.

Goes deeper in: Agent Engineering → Layered Guardrails → Output Filters

Related explainers

Continue in trackAI Agents — Security & the Lethal Trifecta

Frequently Asked Questions