What is the WASH attack?

WASH (Watermark Attenuation via Statistical Hybridisation) is a method for removing an LLM text watermark by averaging the next-token output distributions of 3–5 independent models. Because each provider builds its watermark from its own secret key, the watermarks are effectively independent, so averaging cancels them — the paper proves the average recovers the unwatermarked distribution up to a small second-order error term. WASH's technical contribution is aligning models with different vocabularies and tokenizations so their distributions can be combined.

Why does WASH matter?

Text watermarks are the main technical answer to 'did an AI write this?', underpinning academic-integrity tools, disinformation tracing, and AI-content-labelling rules. WASH shows that answer is fragile the moment a user can query more than one model: averaging just three independent models drops detection z-scores from 5–300 to below 2 — under the threshold of 4 — while text quality reportedly improves about 27.5%.

How does averaging cancel a watermark?

A distributional watermark nudges the model's token scores toward a secret green-list, so watermarked text over-uses green tokens and a detector flags it with a high z-score. Each model's green-list comes from its own key, so they are independent. When you average several models' distributions, any single green-list is over-represented in only a fraction of them, so the boosts spread thin and uniform and the over-representation a detector measures falls back to chance — collapsing the z-score below the detection threshold.

WASH attack washes out LLM text watermarks — Watermark removal by model-averaging

WASH — washing out a text watermark by averaging models

LLM

learnaivisually.com/ai-explained/wash-model-averaging-watermark-removal

TL;DR

What is it: The WASH paper (Watermark Attenuation via Statistical Hybridisation) shows how to remove an LLM text watermark by averaging the output distributions of 3–5 independent models — no access to the watermark key required.
Why it’s needed: Text watermarks are the main technical answer to "did an AI write this?" — provenance for academic-integrity, disinformation, and AI-content-labelling rules. WASH shows that answer is fragile the moment a user can query more than one model, which is already today's reality.
vs previous: Earlier work assumed a distributional watermark — a tilt the provider adds to the model's token scores — survives normal downstream use. WASH shows that simply averaging across providers cancels it, because two providers' watermarks are statistically independent and average toward zero.

Jargon

Distributional watermark: A watermark embedded by perturbing the model's output probability distribution — nudging which tokens get sampled — rather than editing the finished text. Invisible to readers, but statistically detectable.
Green-list watermark: The most common scheme (Kirchenbauer et al.): at each step a secret key splits the vocabulary into a pseudo-random "green list" and a red list, and the model's logits are nudged to favour green tokens. Watermarked text then over-uses green tokens.
Detection z-score: How many standard deviations the green-token count sits above pure chance. A high z-score means "almost certainly watermarked"; most schemes flag text above a threshold of 4.
TPR@5% FPR: True-positive rate at a 5% false-positive rate — the share of watermarked texts correctly caught while wrongly flagging only 5% of clean texts. The honest way to quote a detector's accuracy.
WASH: Watermark Attenuation via Statistical Hybridisation — the paper's method for ensembling heterogeneous models at the distribution level, solving the vocabulary and tokenization mismatches that normally block cross-model averaging.
Second-order error term: What's left after averaging cancels the watermark's first-order (linear) perturbation. The paper proves the average recovers the unwatermarked distribution up to this small residual — i.e. nearly perfectly.

The news. On May 28, 2026, WASH: Watermark Attenuation via Statistical Hybridisation (arXiv:2605.30501) reported that averaging the next-token distributions of 3–5 independent LLMs cancels each model's text watermark. Because each provider's perturbation comes from its own key — making them statistically independent — the authors prove averaging recovers the unwatermarked distribution up to a second-order error term, and they introduce WASH to handle the vocabulary and tokenization mismatches that normally block cross-model ensembling. Their headline: detection z-scores fall from 5–300 to below 2 (threshold 4) while text quality improves. Read the paper →

Picture a diving competition with a panel of judges. Each judge has been quietly told to nudge their score a hair toward their own country's diver — that secret nudge is the watermark. If you only ever saw one judge's card, you could audit it and spot the bias. But every judge tilts a different way, so the moment you average the whole panel, the nudges point in every direction and cancel: you're left with the honest score, and the auditor looking for any single judge's bias finds nothing.

That is exactly what a distributional watermark is and why WASH defeats it. A provider watermarks its model by tilting the next-token scores toward a secret green-list of tokens before a word is sampled. The watermark is invisible to a reader but shows up statistically: watermarked text over-uses green tokens, and a detector flags it with a high z-score. The catch is that each provider builds its green-list from its own secret key, so different models' watermarks are effectively independent. Average their output distributions and any one green-list is over-represented in only a fraction of the models — the boosts spread thin and uniform, no concentrated pattern survives, and the z-score collapses. WASH's real contribution is the plumbing: heterogeneous models use different vocabularies and tokenizations, and WASH aligns them so the distributions can actually be averaged.

How detection holds up as you add models

Output	Detection z-score	TPR@5% FPR	Verdict
Single watermarked model	~5–300 (varies by scheme)	~high	detected
Average of 3 models	~<2	~<50%	not detected
Average of 5 models	~<2	~<50%	not detected

Figures from the WASH paper (validated across 6 watermarking schemes and 3 LLMs); the detection threshold is z = 4. All values are approximate and setup-dependent.

Why does the z-score fall so far? Walk the standard green-list math (illustrative numbers). A green-list scheme splits the vocabulary 50/50 into green and red, so clean text lands roughly 100 green tokens in a 200-token passage — the standard deviation is √(200 × 0.5 × 0.5) ≈ 7.07. A watermarked single model over-produces green tokens, say 134 of 200: that is a z-score of (134 − 100) / 7.07 ≈ 4.8, just above the threshold of 4, so it's flagged. Now average three independent models. Any one detector's green-list overlaps the averaged output only at chance, so the green fraction falls back toward 50% — about 104 of 200, a z-score of (104 − 100) / 7.07 ≈ 0.57, far below 4. The watermark hasn't been edited or scrubbed; it has simply been averaged into the noise floor.

This lands as a pointed counterpoint to the industry's provenance push — Google's SynthID and the C2PA content-credentials standard both lean on watermarks to answer "did a machine make this?" WASH's claim isn't that watermarking is useless; it's that distributional text watermarks assume single-model access, and that assumption broke the day users could route a prompt through several models. As the authors put it: "when users access multiple models (today's reality), watermarks trivially fail."

Goes deeper in: LLM Internals → Text Generation → Logits

Related explainers

Camouflage Injection — the detection gap — another case where a defence's detector quietly stops firing.
Quantization-conditioned attack — outlier injection — a different way a model's statistics can be turned against its safeguards.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based