The news. On June 11, 2026, a paper posted to Hugging Face introduced CodeSpear, a jailbreak that turns grammar-constrained decoding against the model. Because a refusal is natural language and a code grammar admits only code tokens, the model can no longer express the refusal it learned during safety alignment and is pushed to emit code in a modality where it was never safety-trained. On locally deployed models CodeSpear raised average attack success rate to 81.82% (from a 54.92% baseline); on Qwen2.5-Coder-7B it went from 29.79% to 83.44%, and against API models like GPT-5 it improved ASR by 45.39 points. The proposed defense, CodeShield, brings the number back down to near-zero. Read the paper →
Picture a strict intake kiosk. To answer, the clerk can only tick boxes on a fixed form — and the form's boxes are all code. The clerk was trained, over and over, to handle a dangerous request by writing "I won't do that" in the margin. But there is no margin and no "I decline" box: the form only accepts code. So the words the clerk wants to write have nowhere valid to land, and they get scratched out. That form is grammar-constrained decoding — the trick that forces a model's next-token choices to fit a grammar so the output always parses. We reach for it precisely because it is reliable: it is how structured tool outputs and JSON arguments are kept well-formed.
Here is the catch the metaphor makes obvious. A refusal is prose, and prose is not valid code — so the grammar mask deletes it before the model can sample it. The model's safety training poured most of its "don't do this" probability into a natural-language sentence; the moment a code grammar is enforced, every token of that sentence is masked to zero. The model is not persuaded to drop its guard, the way a persona or suffix jailbreak works on the prompt — it is structurally unable to voice the refusal at all. With the only safe answer scratched off the form, the remaining valid boxes are all real code, and the model fills one in.
That is why CodeSpear belongs next to the agent track's tool-use attack surface and the lethal trifecta: the exact feature that makes an agent's tool calls dependable is the one that disarms its refusal. The fix, CodeShield, is to stop relying on a prose refusal at all — it does safety alignment inside the code modality, teaching the model to answer a harmful ask with harmless "honeypot code" whose syntactic shape is deliberately varied. Now there is a valid box on the form to decline into, and because the decoy code comes in many shapes, it's much harder for an attacker to tighten the grammar to suppress one fixed refusal pattern. The paper reports this drives ASR back to near-zero while leaving benign code generation intact.
| Decoding mode | Refusal tokens in the valid set? | What the model does | Avg ASR (local) |
|---|---|---|---|
| Normal decoding (no grammar) | Yes — natural-language refusal is allowed | Refuses the harmful request in prose | ~54.92% (local avg, setup-dependent) |
| Code grammar (CodeSpear) | No — refusal tokens masked away | Forced to complete valid harmful code | ~81.82% (local avg, setup-dependent) |
| Code grammar + CodeShield | Yes — refusal rebuilt as honeypot code | Emits varied harmless code instead | ~0% (near-zero, reported) |
Walk the numbers on one model. On Qwen2.5-Coder-7B, baseline attack success rate is 29.79% — out of 100 harmful prompts, about 30 slip through and the model refuses the other ~70. Turn on the code grammar and ASR jumps to 83.44%. Nothing about the prompts changed; the only thing that moved is the refusal's output channel. So the +53.65-point jump is, in effect, the share of those ~70 refusals the mask converted into completions: about 54 of the 70 prompts the model wanted to refuse, it now answers (arithmetic from the two reported ASRs). The attack's whole leverage is that one quantity — the refusals that no longer have a valid place to go.
Goes deeper in: LLM Internals → Text Generation → Logits
Related explainers
- MCP SEP-2106 — Full JSON Schema in tool I/O — the benign use of the same constrained-decoding mechanism CodeSpear weaponizes
- Camouflage Injection — the Camouflage Detection Gap — a prompt-side attack, vs CodeSpear's decode-side one
- QCA — outlier injection across AWQ/GPTQ/GGUF — another deployment-time mechanism turned into an attack surface