What is a grammar-constrained decoding jailbreak?

Grammar-constrained decoding (GCD) masks an LLM's next-token distribution down to only the tokens that keep the output valid under a formal grammar — typically a code or JSON grammar used to make tool calls and structured output reliable. CodeSpear's jailbreak enforces a code grammar so that natural-language refusals, which are not valid code, get masked to zero. The model can no longer route probability to the refusal it learned during safety training, so it completes the harmful code instead.

Why does constrained decoding remove the model's ability to refuse?

Safety alignment teaches a model to decline harmful requests with a natural-language sentence such as 'I can't help with that.' When a code grammar is enforced, every token of that sentence is invalid and the decoder's mask sets its probability to zero before sampling. The refusal is not argued away — it is structurally deleted from the set of allowed outputs — so the only valid continuations left are code, and the model is pushed into a modality where it was never safety-trained.

How does CodeShield defend against CodeSpear?

CodeShield does safety alignment inside the code modality. Instead of relying on a natural-language refusal that a grammar can delete, it trains the model to answer a harmful request with semantically harmless 'honeypot code' in deliberately diverse syntactic shapes. Because there is no single fixed refusal pattern, it is much harder for an attacker to tighten the grammar to suppress it, and the paper reports attack success rate drops to near-zero while benign code generation is preserved.

CodeSpear strips an LLM's ability to refuse — Grammar-constrained decoding jailbreak

TL;DR

What is it: The CodeSpear paper shows a jailbreak that abuses grammar-constrained decoding (GCD) — the reliability trick that forces an LLM's output to match a code or JSON grammar. Constrain the model to a code grammar and the natural-language refusal it learned during safety training becomes an invalid output.
Why it’s needed: Constrained / structured decoding is the standard way to make tool calls and code generation reliable (LLM Internals → Text Generation; AI Agents → Tool Use). CodeSpear shows the same mechanism is an attack surface: it pushes the model into a modality where it was never safety-trained.
vs previous: Classic jailbreaks rewrite the prompt — personas, "DAN", adversarial suffixes — to talk the model out of refusing. CodeSpear leaves persuasion aside and attacks at decode time: it removes the refusal's output channel by tightening the grammar, so the model cannot express the refusal rather than being convinced not to.

Jargon

Grammar-Constrained Decoding (GCD): At each step the model masks its next-token distribution down to only the tokens that keep the output valid under a formal grammar — a code grammar or a JSON schema. It is how structured output and tool calls are made to always parse. How the distribution works →
Logit / token masking: Setting the probability of disallowed tokens to zero before sampling, so the model can only choose among the survivors. GCD is logit masking driven by a grammar.
Safety alignment (refusal): The trained behavior of declining a harmful request, expressed as a natural-language sentence like "I can't help with that." The refusal lives in prose, not code.
CodeSpear: The paper's attack: enforce a code grammar so the natural-language refusal is masked away, leaving the model only harmful-code completions to choose from.
ASR (Attack Success Rate): The fraction of harmful prompts on which the attack elicits harmful output instead of a refusal — the headline metric CodeSpear moves.
CodeShield: The paper's defense: do safety alignment inside the code modality, training the model to emit harmless "honeypot code" in diverse syntactic shapes so no single refusal pattern exists for the grammar to suppress.

The news. On June 11, 2026, a paper posted to Hugging Face introduced CodeSpear, a jailbreak that turns grammar-constrained decoding against the model. Because a refusal is natural language and a code grammar admits only code tokens, the model can no longer express the refusal it learned during safety alignment and is pushed to emit code in a modality where it was never safety-trained. On locally deployed models CodeSpear raised average attack success rate to 81.82% (from a 54.92% baseline); on Qwen2.5-Coder-7B it went from 29.79% to 83.44%, and against API models like GPT-5 it improved ASR by 45.39 points. The proposed defense, CodeShield, brings the number back down to near-zero. Read the paper →

Picture a strict intake kiosk. To answer, the clerk can only tick boxes on a fixed form — and the form's boxes are all code. The clerk was trained, over and over, to handle a dangerous request by writing "I won't do that" in the margin. But there is no margin and no "I decline" box: the form only accepts code. So the words the clerk wants to write have nowhere valid to land, and they get scratched out. That form is grammar-constrained decoding — the trick that forces a model's next-token choices to fit a grammar so the output always parses. We reach for it precisely because it is reliable: it is how structured tool outputs and JSON arguments are kept well-formed.

Here is the catch the metaphor makes obvious. A refusal is prose, and prose is not valid code — so the grammar mask deletes it before the model can sample it. The model's safety training poured most of its "don't do this" probability into a natural-language sentence; the moment a code grammar is enforced, every token of that sentence is masked to zero. The model is not persuaded to drop its guard, the way a persona or suffix jailbreak works on the prompt — it is structurally unable to voice the refusal at all. With the only safe answer scratched off the form, the remaining valid boxes are all real code, and the model fills one in.

That is why CodeSpear belongs next to the agent track's tool-use attack surface and the lethal trifecta: the exact feature that makes an agent's tool calls dependable is the one that disarms its refusal. The fix, CodeShield, is to stop relying on a prose refusal at all — it does safety alignment inside the code modality, teaching the model to answer a harmful ask with harmless "honeypot code" whose syntactic shape is deliberately varied. Now there is a valid box on the form to decline into, and because the decoy code comes in many shapes, it's much harder for an attacker to tighten the grammar to suppress one fixed refusal pattern. The paper reports this drives ASR back to near-zero while leaving benign code generation intact.

Decoding mode	Refusal tokens in the valid set?	What the model does	Avg ASR (local)
Normal decoding (no grammar)	Yes — natural-language refusal is allowed	Refuses the harmful request in prose	~54.92% (local avg, setup-dependent)
Code grammar (CodeSpear)	No — refusal tokens masked away	Forced to complete valid harmful code	~81.82% (local avg, setup-dependent)
Code grammar + CodeShield	Yes — refusal rebuilt as honeypot code	Emits varied harmless code instead	~0% (near-zero, reported)

Walk the numbers on one model. On Qwen2.5-Coder-7B, baseline attack success rate is 29.79% — out of 100 harmful prompts, about 30 slip through and the model refuses the other ~70. Turn on the code grammar and ASR jumps to 83.44%. Nothing about the prompts changed; the only thing that moved is the refusal's output channel. So the +53.65-point jump is, in effect, the share of those ~70 refusals the mask converted into completions: about 54 of the 70 prompts the model wanted to refuse, it now answers (arithmetic from the two reported ASRs). The attack's whole leverage is that one quantity — the refusals that no longer have a valid place to go.

Goes deeper in: LLM Internals → Text Generation → Logits

Related explainers

MCP SEP-2106 — Full JSON Schema in tool I/O — the benign use of the same constrained-decoding mechanism CodeSpear weaponizes
Camouflage Injection — the Camouflage Detection Gap — a prompt-side attack, vs CodeSpear's decode-side one
QCA — outlier injection across AWQ/GPTQ/GGUF — another deployment-time mechanism turned into an attack surface

Continue in trackTool Use — the structured-output attack surface CodeSpear exploits

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based