WASH — washing out a text watermark by averaging models

LLM
L
Averaging a few models washes out each one's text watermarkModel AModel BModel Camber = the watermark's boost toward secret tokensaverage÷ 3Averaged output distributionboosts spread thin & uniform — recovers the clean distributiondetectionz-score4up to 300detectedwatermark detectedwashed outz 5–300< 2below the detection threshold of 4 — averaging just 3 models…and text quality improves ~27.5%
learnaivisually.com/ai-explained/wash-model-averaging-watermark-removal

The news. On May 28, 2026, WASH: Watermark Attenuation via Statistical Hybridisation (arXiv:2605.30501) reported that averaging the next-token distributions of 3–5 independent LLMs cancels each model's text watermark. Because each provider's perturbation comes from its own key — making them statistically independent — the authors prove averaging recovers the unwatermarked distribution up to a second-order error term, and they introduce WASH to handle the vocabulary and tokenization mismatches that normally block cross-model ensembling. Their headline: detection z-scores fall from 5–300 to below 2 (threshold 4) while text quality improves. Read the paper →

Picture a diving competition with a panel of judges. Each judge has been quietly told to nudge their score a hair toward their own country's diver — that secret nudge is the watermark. If you only ever saw one judge's card, you could audit it and spot the bias. But every judge tilts a different way, so the moment you average the whole panel, the nudges point in every direction and cancel: you're left with the honest score, and the auditor looking for any single judge's bias finds nothing.

That is exactly what a distributional watermark is and why WASH defeats it. A provider watermarks its model by tilting the next-token scores toward a secret green-list of tokens before a word is sampled. The watermark is invisible to a reader but shows up statistically: watermarked text over-uses green tokens, and a detector flags it with a high z-score. The catch is that each provider builds its green-list from its own secret key, so different models' watermarks are effectively independent. Average their output distributions and any one green-list is over-represented in only a fraction of the models — the boosts spread thin and uniform, no concentrated pattern survives, and the z-score collapses. WASH's real contribution is the plumbing: heterogeneous models use different vocabularies and tokenizations, and WASH aligns them so the distributions can actually be averaged.

How detection holds up as you add models

OutputDetection z-scoreTPR@5% FPRVerdict
Single watermarked model~5–300 (varies by scheme)~highdetected
Average of 3 models~<2~<50%not detected
Average of 5 models~<2~<50%not detected

Figures from the WASH paper (validated across 6 watermarking schemes and 3 LLMs); the detection threshold is z = 4. All values are approximate and setup-dependent.

Why does the z-score fall so far? Walk the standard green-list math (illustrative numbers). A green-list scheme splits the vocabulary 50/50 into green and red, so clean text lands roughly 100 green tokens in a 200-token passage — the standard deviation is √(200 × 0.5 × 0.5) ≈ 7.07. A watermarked single model over-produces green tokens, say 134 of 200: that is a z-score of (134 − 100) / 7.07 ≈ 4.8, just above the threshold of 4, so it's flagged. Now average three independent models. Any one detector's green-list overlaps the averaged output only at chance, so the green fraction falls back toward 50% — about 104 of 200, a z-score of (104 − 100) / 7.07 ≈ 0.57, far below 4. The watermark hasn't been edited or scrubbed; it has simply been averaged into the noise floor.

This lands as a pointed counterpoint to the industry's provenance push — Google's SynthID and the C2PA content-credentials standard both lean on watermarks to answer "did a machine make this?" WASH's claim isn't that watermarking is useless; it's that distributional text watermarks assume single-model access, and that assumption broke the day users could route a prompt through several models. As the authors put it: "when users access multiple models (today's reality), watermarks trivially fail."

Goes deeper in: LLM Internals → Text Generation → Logits

Related explainers

Frequently Asked Questions