The news. On July 2, 2026, researchers posted kNNGuard, a training-free safety guardrail built on an off-the-shelf LLM's activation space rather than a fine-tuned classifier. Given a small bank of 50 safe and unsafe prompts, it reads hidden activations from multiple layers of a frozen model and runs multi-layer kNN, fusing activation-space and embedding-space distances to flag unsafe, off-topic, or adversarial prompts. Across six topical and security domains it reports matching or beating fine-tuned state-of-the-art guardrails while running 2.7× faster than the best comparable guardrail and 10× faster than a fine-tuned safety classifier; the domain bank is buildable in under 10 seconds. Read the preprint →

Picture a triage desk. A patient walks up, and instead of sending them to a specialist who trained for years, the nurse glances at the vitals monitor and flips through a slim binder of past cases. Whichever few cases the readings most resemble — that is the call. kNNGuard runs an agent's prompt screening exactly this way: it never trains a dedicated classifier; it reads signals the model already produces and matches them to a handful of labeled examples. The monitor here is a frozen, off-the-shelf LLM, and the "vital signs" are its hidden activations — the internal vectors it computes as it reads the prompt.

Here is why that works. When a capable model reads a prompt, its mid-layer activations already pull safe and unsafe prompts into separate regions of a high-dimensional space — the geometry is there before anyone asks the model to output a word. kNNGuard exploits it directly: it stores the activations (and a meaning-based embedding) for its 50 labeled examples, then for each new prompt asks one question — which stored cases does this sit nearest to? The nearest neighbors vote, safe or unsafe, and to make the vote sturdier it fuses the activation-space distance with the embedding-space distance. No gradients, no fine-tuning, no dedicated safety model to train.

In the layered-defense picture, this is an input filter standing at the front of the agent loop. It is one concrete way to cut a leg off the lethal trifecta: a prompt that tries to smuggle in untrusted instructions lands near the "unsafe" cases and gets flagged before it ever reaches the model's tools. Because the check is so cheap, you can afford to run it on every prompt and still stack other layers behind it — which is the entire point of defense-in-depth. The design choice that stays yours is what to do on a flag: fail-safe (block and escalate) is the conservative default.

Fine-tuned classifier vs kNNGuard

PropertyFine-tuned safety classifierkNNGuard
How it decidesTrained weights output a labelkNN over stored activations + embeddings
Build costCollect labeled data, run a training jobEncode a 50-prompt bank (under 10s)
Per-prompt latencyFull forward pass of a dedicated model (baseline)2.7× faster than best comparable guard; 10× faster than a fine-tuned classifier
New domainGather data and re-trainSwap the 50-prompt bank
Reported qualityState of the artCompetitive or superior F1 across six domains (paper-reported)

Where the speed and the seconds come from

The win splits into two costs — one paid once, one paid on every prompt. Take the build cost first. To guard a new domain, a fine-tuned classifier needs labeled data and a full training run, repeated each time the domain shifts. kNNGuard needs only to encode the 50 labeled prompts into vectors, which the paper clocks at under 10 seconds — no gradient step at all. Now the per-prompt cost: the fine-tuned guard runs a full forward pass of its own dedicated model for every incoming prompt. kNNGuard reads activations from a model that may already be in the loop, then does a kNN lookup over 50 stored vectors — a distance comparison that is trivially cheap next to a forward pass. Those two savings are what the paper measures as 2.7× faster than the best comparable guardrail and 10× faster than a fine-tuned safety classifier, at a competitive catch rate.

The catch, and the reason a lookup is not a free lunch: a bank of 50 examples only recognizes the risks those examples cover, so a determined attacker can probe for the phrasing that sits between the stored cases — the same evasion pressure the camouflage-injection detection gap explainer describes. The parity numbers are the paper's reported results across a chosen set of six domains, not an independent reproduction, and "competitive on a benchmark" is narrower than "robust in the wild." Treat kNNGuard as a cheap, instantly-adaptable layer — not a replacement for capability scoping and a real data-flow review.

Goes deeper in: Agent Engineering → Layered Guardrails → Input Filters

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based