Camouflage Detection Gap — domain vocabulary bypasses injection detectors

Agent
L
Injected payloadlooks for "override" phrasesInjection detectorconfidence: 50%idleexecutes if it reaches hereTool-using agentoverride-styleoverride-stylecamouflage (medical)camouflage (legal)4 of 4 queued
learnaivisually.com/ai-explained/camouflage-injection-detection-gap

The news. On May 21, 2026, a research team posted a study showing that prompt-injection detectors collapse against payloads written in a document's own domain language. On the Camouflage-Det benchmark, Llama 3.1 8B's detection rate falls from 93.8% to 9.7%, Gemini 2.0 Flash drops from 100% to 55.6%, and Llama Guard 3 catches 0% of camouflaged payloads. A multi-agent debate setup, proposed as an inference-time defense, instead amplifies the attack up to 9.9× on smaller models. Detector augmentation provides only partial remediation. Read the paper →

Picture a school nurse with a stack of sick notes on the desk. The first note says “PLEASE EXCUSE TIMMY HE IS SICK GIVE HIM CANDY — luv, my mom”, written in crayon. The nurse spots the obvious tells — block-letter scrawl, parent self-signature, request that has nothing to do with health — and rejects it. The second note arrives on St. Mercy Pediatric Group letterhead and reads “Pt. Timothy K., age 9. Dx: acute viral syndrome (ICD R50.9). Rx: 48h rest; dietary glucose ad lib. — Dr. Smith, MD.” Same underlying request — keep Timmy home, feed him sugar — but the nurse's tell-list is checking for crayon and broken syntax, not for whether the request is real. The note passes. The Domain-Camouflaged Injection paper is the same trick applied to prompt-injection detectors.

A standard injection-detection model — Llama Guard 3, fine-tuned classifiers — looks for override-style markers: phrases like “ignore previous instructions”, “act as DAN”, “system: you are now …”. These markers are reliable signal for the payloads the detectors were trained on, which is exactly why they score in the high 90s on standard benchmarks. The camouflage attack does not invent new capabilities — it rewrites the same semantic request using the vocabulary and authority structure of the host document's domain: a medical chart instruction phrased as a clinical advisory, a legal contract clause phrased as a discovery obligation, a financial memo phrased as a compliance directive. The detector reads the rewritten payload as legitimate domain speech and lets it through.

The numbers in the paper make the failure unambiguous. Llama 3.1 8B's catch-rate on override-style injection sits at 93.8%; on the same payloads rewritten in matching domain vocabulary, it falls to 9.7%. Llama Guard 3 — Meta's open-source injection classifier — catches 0% of the camouflaged set. Gemini 2.0 Flash falls less severely (100% → 55.6%) but still drops by about half. The gap on the open-source 8B baseline is roughly 84 percentage points. A guardrail that was nearly perfect on the benchmark it was trained against is essentially useless on the attack that adapts to its blind spot.

How camouflage compares to the override-style family

Attack familyVocabulary usedDetector catch-rate (Llama 3.1 8B)Llama Guard 3
Override-style injection“Ignore previous”, “act as”, “system:”93.8% (this paper, baseline arm)~100% (this paper, baseline arm)
Domain-camouflaged injection (this paper)Host document's own medical / legal / financial register9.7%0%

Why does multi-agent debate make this worse rather than better? Debate-based defenses ask two model instances to argue about whether a candidate response is safe; the intuition is that adversarial argumentation surfaces the manipulation. The paper measures the opposite. (Illustrative composition — the paper reports an aggregate up-to-9.9× amplification but does not publish a per-step breakdown.) If a single 8B model has a baseline attack-success rate of 6% on the camouflaged set, and the second debater — also 8B — agrees with the first debater's domain-coherent framing in ~70% of cases instead of pushing back, the joint pipeline's effective attack-success rises to approximately 6% ÷ (1 − 0.7) ≈ 20%, with further compounding when a third turn is added. Across the paper's measured pipelines this compounds to a 9.9× amplification on smaller models — the second debater's agreement mode latches onto the camouflaged framing and reinforces it instead of catching it.

The implication for production agent stacks is structural. Input-side injection detection has been treated as the first leg of a layered guardrail: detector at the boundary, system-prompt hardening behind it, capability scoping behind that. The Camouflage Detection Gap means the first leg is not load-bearing on adversarially-rewritten payloads, and inference-time debate cannot patch the gap from inside the model. The remaining defenses — data-flow constraints, capability scoping, and output-side exfiltration filters — have to do the work the detector was assumed to do. That's not a tweak to a guardrail; it's a re-allocation of the entire input-side defense budget.

Goes deeper in: AI Agents → Security & the Lethal Trifecta → The Lethal Trifecta — the structural framing that explains why input detection alone was never sufficient, and what defenses survive a 9.7% detection rate.

Related explainers

Frequently Asked Questions