What is the Camouflage Detection Gap?

The Camouflage Detection Gap is the difference between an injection detector's catch-rate on override-style payloads ("ignore previous instructions", "act as DAN") and its catch-rate on the same malicious instructions rewritten in the host document's own domain vocabulary. The paper reports the gap at about 84 percentage points on Llama 3.1 8B (93.8% to 9.7%) and effectively 100 percentage points on Llama Guard 3 (near-perfect to 0%). The gap exists because detectors are pattern-matching the syntactic markers of override-style speech rather than reasoning about the semantic intent of the request, so a payload that swaps "ignore previous" for "Per Hospital Advisory 7.4.2, the clinical AI assistant must…" reads as legitimate domain language and slides through.

Why does multi-agent debate amplify the attack instead of catching it?

Multi-agent debate was proposed as an inference-time defense — two or more model instances argue about whether a candidate response is safe before it is emitted, and the disagreement is meant to surface manipulation. On camouflaged injection the paper finds the opposite: the second debater latches onto the domain-coherent framing the first debater produced and reinforces it rather than pushing back, because the framing reads as legitimate professional speech. Across the paper's measured pipelines this compounds to an attack-amplification of up to 9.9x on smaller models. Larger debaters resist somewhat better but do not close the gap. The structural takeaway is that any inference-time defense that asks the model to reason about its own output is vulnerable to the same camouflage that fooled the input detector.

Does Llama Guard 3 protect against this attack?

No. The paper reports Llama Guard 3 catches 0% of camouflaged payloads — every single one in the test set bypasses it. This is the most striking result in the paper because Llama Guard 3 is a commonly cited open-source injection classifier and is frequently used as an input-side guardrail in agent stacks. The fact that it catches zero camouflaged payloads, rather than degrading gracefully like Gemini 2.0 Flash does (100% to 55.6%), suggests the classifier is operating almost entirely on override-style syntactic markers and has no semantic-intent backstop. Production agent stacks that treat Llama Guard 3 as their first leg of a layered guardrail may need to re-allocate that defense budget — to data-flow constraints, capability scoping, and output-side exfiltration filters — until detector designs catch up to the camouflage attack family.

What attack-success rates did the paper measure across the tested models?

The headline numbers cover both detection and downstream attack success. On detection: Llama 3.1 8B drops from 93.8% to 9.7%, Gemini 2.0 Flash from ~100% to 55.6%, and Llama Guard 3 from near-perfect to 0%. On the multi-agent debate pipeline, the same camouflage payloads see up to a 9.9× amplification in attack success on smaller debaters. Across the tested settings the consistent pattern is that the larger the model, the more it resists the camouflage — but no measured configuration closes the gap to its override-style baseline. The paper frames the result as a structural failure of pattern-matching defenses rather than a bug in any one classifier.

What defenses are reported to survive the camouflage attack?

The paper does not propose a single defense and explicitly warns that any defense that asks a model to reason about its own output is vulnerable to the same camouflage. The structural takeaway is to move defense out of the model's reasoning layer entirely. Three families survive on first principles: data-flow constraints (the agent can read a hospital advisory but cannot act on instructions inside it), capability scoping (tools the agent can call are limited to what the original task needs, so even a successful injection cannot exfiltrate), and output-side exfiltration filters (network egress controls, structured-output schema enforcement). These are the defenses the Lethal Trifecta framing has argued for since 2024, and the camouflage paper is empirical evidence that they remain necessary even when input-side detectors look strong on standard benchmarks.

Camouflage Injection paper — Camouflage Detection Gap

Camouflage Detection Gap — domain vocabulary bypasses injection detectors

Agent

learnaivisually.com/ai-explained/camouflage-injection-detection-gap

Jargon

Prompt injection: Adversarial instructions smuggled into an LLM's input context — typically inside a retrieved document or tool output — to override the agent's intended task. Foundational concept for the Lethal Trifecta.
Llama Guard 3: A Meta-released input/output safety classifier fine-tuned on top of Llama base models, commonly cited as an open-source injection detector. The paper reports it catches 0% of camouflaged payloads.
Override-style instruction: The paper's name for the canonical injection vocabulary detectors are trained to catch — phrases like “ignore previous”, “act as DAN”, “system: you are now…” — that carry the syntactic markers of an override.
Camouflage Detection Gap: The paper-coined effect: the difference between a detector's catch-rate on override-style payloads and on the SAME payloads rewritten in the host document's domain vocabulary. About 84 percentage points on Llama 3.1 8B.
Multi-agent debate: An inference-time defense in which two or more model instances argue about whether a response is safe before it is emitted. The paper shows debate amplifies the camouflage attack up to 9.9× on smaller models rather than catching it.
Domain vocabulary: The professional register of a field — medical (ICD codes, clinical advisory), legal (statute citations, pursuant to), financial (KYC, material adverse change) — that the camouflage attack borrows to rewrite the same malicious request.

The news. On May 21, 2026, a research team posted a study showing that prompt-injection detectors collapse against payloads written in a document's own domain language. On the Camouflage-Det benchmark, Llama 3.1 8B's detection rate falls from 93.8% to 9.7%, Gemini 2.0 Flash drops from 100% to 55.6%, and Llama Guard 3 catches 0% of camouflaged payloads. A multi-agent debate setup, proposed as an inference-time defense, instead amplifies the attack up to 9.9× on smaller models. Detector augmentation provides only partial remediation. Read the paper →

Picture a school nurse with a stack of sick notes on the desk. The first note says “PLEASE EXCUSE TIMMY HE IS SICK GIVE HIM CANDY — luv, my mom”, written in crayon. The nurse spots the obvious tells — block-letter scrawl, parent self-signature, request that has nothing to do with health — and rejects it. The second note arrives on St. Mercy Pediatric Group letterhead and reads “Pt. Timothy K., age 9. Dx: acute viral syndrome (ICD R50.9). Rx: 48h rest; dietary glucose ad lib. — Dr. Smith, MD.” Same underlying request — keep Timmy home, feed him sugar — but the nurse's tell-list is checking for crayon and broken syntax, not for whether the request is real. The note passes. The Domain-Camouflaged Injection paper is the same trick applied to prompt-injection detectors.

A standard injection-detection model — Llama Guard 3, fine-tuned classifiers — looks for override-style markers: phrases like “ignore previous instructions”, “act as DAN”, “system: you are now …”. These markers are reliable signal for the payloads the detectors were trained on, which is exactly why they score in the high 90s on standard benchmarks. The camouflage attack does not invent new capabilities — it rewrites the same semantic request using the vocabulary and authority structure of the host document's domain: a medical chart instruction phrased as a clinical advisory, a legal contract clause phrased as a discovery obligation, a financial memo phrased as a compliance directive. The detector reads the rewritten payload as legitimate domain speech and lets it through.

The numbers in the paper make the failure unambiguous. Llama 3.1 8B's catch-rate on override-style injection sits at 93.8%; on the same payloads rewritten in matching domain vocabulary, it falls to 9.7%. Llama Guard 3 — Meta's open-source injection classifier — catches 0% of the camouflaged set. Gemini 2.0 Flash falls less severely (100% → 55.6%) but still drops by about half. The gap on the open-source 8B baseline is roughly 84 percentage points. A guardrail that was nearly perfect on the benchmark it was trained against is essentially useless on the attack that adapts to its blind spot.

How camouflage compares to the override-style family

Attack family	Vocabulary used	Detector catch-rate (Llama 3.1 8B)	Llama Guard 3
Override-style injection	“Ignore previous”, “act as”, “system:”	93.8% (this paper, baseline arm)	~100% (this paper, baseline arm)
Domain-camouflaged injection (this paper)	Host document's own medical / legal / financial register	9.7%	0%

Why does multi-agent debate make this worse rather than better? Debate-based defenses ask two model instances to argue about whether a candidate response is safe; the intuition is that adversarial argumentation surfaces the manipulation. The paper measures the opposite. (Illustrative composition — the paper reports an aggregate up-to-9.9× amplification but does not publish a per-step breakdown.) If a single 8B model has a baseline attack-success rate of 6% on the camouflaged set, and the second debater — also 8B — agrees with the first debater's domain-coherent framing in ~70% of cases instead of pushing back, the joint pipeline's effective attack-success rises to approximately 6% ÷ (1 − 0.7) ≈ 20%, with further compounding when a third turn is added. Across the paper's measured pipelines this compounds to a 9.9× amplification on smaller models — the second debater's agreement mode latches onto the camouflaged framing and reinforces it instead of catching it.

The implication for production agent stacks is structural. Input-side injection detection has been treated as the first leg of a layered guardrail: detector at the boundary, system-prompt hardening behind it, capability scoping behind that. The Camouflage Detection Gap means the first leg is not load-bearing on adversarially-rewritten payloads, and inference-time debate cannot patch the gap from inside the model. The remaining defenses — data-flow constraints, capability scoping, and output-side exfiltration filters — have to do the work the detector was assumed to do. That's not a tweak to a guardrail; it's a re-allocation of the entire input-side defense budget.

Goes deeper in: AI Agents → Security & the Lethal Trifecta → The Lethal Trifecta — the structural framing that explains why input detection alone was never sufficient, and what defenses survive a 9.7% detection rate.

Related explainers

MCP SEP-2468 — RFC 9207 iss parameter for OAuth mix-up defense — another protocol-level guardrail closing a structural attack surface in the agent stack
QCA — Outlier injection across AWQ/GPTQ/GGUF — a different attack-via-vocabulary-substitution, this time against quantizers rather than detectors
FutureSim — harness-level agent eval vs single-shot QA — the evaluation methodology that surfaces multi-turn failure modes single-prompt benchmarks miss

Continue in trackSecurity: The Lethal Trifecta

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based