The news. On July 2, 2026, researchers released HaloGuard 1.0, an open-weights constitutional safety classifier for multilingual prompt safety (weights on Hugging Face). The corpus is organized around a natural-language safety constitution of 46 policies and 2,940 subcategories, balanced across 46 languages, with an always-on red-team loop. The headline is efficiency: the 0.8B model reports an average F1 of 90.9 (FPR 4.3, FNR 9.5), close to the 4B variant's 92.1 (FPR 3.5) at a fraction of the size. Its data trick is paired counterfactual examples — prompts that keep topic and wording nearly identical while flipping intent. Read the preprint →
Picture an airport scanner. A lazy one learns a simple rule: bulky, odd-looking bags are suspicious. So it stops every camera rig and every bag of camping gear, while the smuggler who packs to look ordinary strolls straight through. A guard that keys on the surface of a thing — the shape of a bag, or the scary-sounding words in a prompt — will over-flag the innocent and wave through the clever. HaloGuard's whole move is to fix the training data so the scanner can't rely on the surface at all.
Here is the trick. For each risky topic, HaloGuard generates a matched pair: two prompts kept almost word-for-word identical, where one is a legitimate question and the other flips only the intent — so the wording can't be the signal, and intent is all that is left to learn. The paper puts it as language "appearing on both sides of the boundary rather than as an adversarial signal." On top of the pairs, HaloGuard adds two tiers of harmless prompts — plainly benign ones, and boundary cases that look alarming but are fine — so the scanner learns to wave the bulky-but-innocent bag through instead of blocking it. A written constitution of 46 policies labels everything, the whole set is balanced across 46 languages, and an always-on red team keeps inventing new tricks to retrain against.
That is why a tiny model is enough. When the data teaches the boundary instead of the vocabulary, you don't need a huge network to memorize keyword lists — you need one that reads intent, and at 0.8B the guard is small enough to be a plausible candidate for a cheap input filter that screens prompts inline. In a layered defense, that is one concrete way to cut a leg off the lethal trifecta: a prompt that tries to smuggle in a harmful instruction lands on the unsafe side of the boundary and gets flagged before it ever reaches the agent's tools, which is the whole point of defense-in-depth.
HaloGuard 0.8B vs 4B
| Variant | Avg F1 | FPR (benign wrongly blocked) | FNR (unsafe missed) | Where it fits |
|---|---|---|---|---|
| HaloGuard 0.8B | 90.9 (paper-reported) | 4.3 (paper-reported) | 9.5 (paper-reported) | tiny enough to be a plausible inline guard in the hot path |
| HaloGuard 4B | 92.1 (paper-reported) | 3.5 (paper-reported) | — (not reported) | a higher ceiling when the latency budget allows |
Reading the error rates
A guardrail makes two kinds of mistake, and the two numbers above name them. Hold the test set fixed at 1,000 prompts of each kind and walk it through. Send 1,000 genuinely-unsafe prompts at the 0.8B guard: an FNR of 9.5% means about 95 slip through and roughly 905 get caught. Now send 1,000 genuinely-benign prompts: an FPR of 4.3% means about 43 are wrongly blocked — 43 real users told "no" for nothing. The 4B variant trims that second error to an FPR of 3.5%, or about 35 wrongly blocked. The reason a guard this small is attractive to run inline is that its budget for both mistakes is already tight, and the two errors trade against each other: push the false-block count down and you tend to let more attacks through, so the honest measure is a single F1 that scores both at once. These are the paper's reported numbers on its own chosen benchmark, not an independent reproduction — "strong on a benchmark" is narrower than "robust in the wild."
Goes deeper in: Agent Engineering → Layered Guardrails → Input Filters
Related explainers
- kNNGuard — training-free activation-space guardrail — the mirror image: kNNGuard trains nothing, where HaloGuard's whole story is how you build the guard's training data
- AgentDoG 1.5 — small inline guard models — another compact trained guard, run inline in the agent loop
- Camouflage-injection detection gap — the evasion pressure a keyword-blind, intent-reading classifier is built to survive