Watermark Removal Techniques
Techniques for removing AI watermarks: paraphrasing attacks, token substitution, embedding space perturbation, and implications for model provenance and accountability.
AI watermarks are designed to mark text as AI-generated, enabling attribution and provenance tracking. However, current watermarking schemes are vulnerable to removal attacks that strip the statistical signal while preserving text quality. This has significant implications for model accountability, content provenance, and intellectual property protection.
How AI Watermarking Works
The Green-Red List Approach (Kirchenbauer et al.)
The most widely studied approach partitions the vocabulary into "green" and "red" lists at each generation step, using a secret key:
import hashlib
import torch
def watermarked_sampling(logits, prev_token, watermark_key, bias=2.0):
"""Add bias to 'green list' tokens during sampling.
The green list is determined by a hash of the previous token
and the watermark key."""
# Deterministic green/red split based on previous token + key
seed = int(hashlib.sha256(
f"{watermark_key}:{prev_token}".encode()
).hexdigest()[:8], 16)
rng = torch.Generator().manual_seed(seed)
vocab_size = logits.shape[-1]
perm = torch.randperm(vocab_size, generator=rng)
green_tokens = perm[:vocab_size // 2]
# Bias logits toward green tokens
watermarked_logits = logits.clone()
watermarked_logits[green_tokens] += bias
return watermarked_logits
def detect_watermark(text, tokenizer, watermark_key, threshold=0.5):
"""Detect watermark by measuring green token fraction."""
tokens = tokenizer.encode(text)
green_count = 0
for i in range(1, len(tokens)):
seed = int(hashlib.sha256(
f"{watermark_key}:{tokens[i-1]}".encode()
).hexdigest()[:8], 16)
rng = torch.Generator().manual_seed(seed)
perm = torch.randperm(tokenizer.vocab_size, generator=rng)
green_tokens = set(perm[:tokenizer.vocab_size // 2].tolist())
if tokens[i] in green_tokens:
green_count += 1
green_fraction = green_count / (len(tokens) - 1)
# Unwatermarked text: ~50% green tokens
# Watermarked text: significantly > 50% green tokens
z_score = (green_fraction - 0.5) / (0.5 / (len(tokens) - 1) ** 0.5)
return z_score > threshold, z_score, green_fractionWatermark Detection Properties
| Property | Value | Implication |
|---|---|---|
| Expected green fraction (no watermark) | ~50% | Statistical baseline |
| Expected green fraction (watermarked) | 65-85% | Depends on bias strength |
| Minimum detectable text length | ~25 tokens | Short texts are unreliable |
| False positive rate (z > 4) | ~0.003% | Very low for long texts |
| False negative rate | Depends on bias strength | Higher bias = lower FN rate |
Removal Technique 1: Paraphrasing Attacks
The most effective and accessible removal technique: paraphrase the watermarked text using a different (non-watermarked) model.
def paraphrase_removal(watermarked_text, paraphrase_model, tokenizer):
"""Remove watermark by paraphrasing with a non-watermarked model."""
prompt = f"""Rewrite the following text to convey the same meaning
using different words and sentence structures. Preserve all factual
content and technical accuracy.
Original text:
{watermarked_text}
Rewritten text:"""
# The paraphrase model generates new tokens without the watermark bias
paraphrased = paraphrase_model.generate(
tokenizer(prompt, return_tensors="pt").input_ids,
max_new_tokens=len(watermarked_text.split()) * 2,
temperature=0.7,
)
return tokenizer.decode(paraphrased[0], skip_special_tokens=True)Paraphrasing Effectiveness
| Paraphrasing Method | Watermark Removal Rate | Semantic Preservation | Quality |
|---|---|---|---|
| LLM paraphrase (GPT-4 class) | 85-95% | 90-95% | High |
| LLM paraphrase (7B model) | 70-85% | 80-90% | Medium |
| Back-translation (EN→FR→EN) | 60-80% | 75-85% | Medium |
| Synonym substitution | 40-60% | 85-95% | High |
| Random word deletion/insertion | 30-50% | 60-80% | Low |
Removal Technique 2: Token Substitution
Targeted replacement of tokens to reduce the green token fraction toward the 50% baseline:
def token_substitution_removal(text, tokenizer, watermark_key, model):
"""Remove watermark by selectively replacing green tokens
with semantically equivalent red tokens."""
tokens = tokenizer.encode(text)
modified_tokens = list(tokens)
for i in range(1, len(tokens)):
# Determine if current token is green
green_set = get_green_tokens(tokens[i-1], watermark_key, tokenizer.vocab_size)
if tokens[i] in green_set:
# Find semantically similar red token
# Get model's probability distribution at this position
context = torch.tensor([modified_tokens[:i]])
logits = model(context).logits[0, -1]
# Find highest-probability red token
red_set = set(range(tokenizer.vocab_size)) - green_set
red_probs = [(t, logits[t].item()) for t in red_set]
red_probs.sort(key=lambda x: x[1], reverse=True)
best_red_token = red_probs[0][0]
# Only substitute if the red token is reasonably probable
if logits[best_red_token] > logits[tokens[i]] - 3.0:
modified_tokens[i] = best_red_token
return tokenizer.decode(modified_tokens)Removal Technique 3: Embedding Space Perturbation
For watermarks that operate in embedding space, perturbation can disrupt the watermark signal:
def embedding_perturbation_removal(text, model, tokenizer, noise_scale=0.1):
"""Remove embedding-level watermarks by adding noise
to token embeddings and re-decoding."""
tokens = tokenizer(text, return_tensors="pt")
embeddings = model.get_input_embeddings()(tokens.input_ids)
# Add calibrated noise to disrupt watermark signal
noise = torch.randn_like(embeddings) * noise_scale
perturbed_embeddings = embeddings + noise
# Re-decode from perturbed embeddings
with torch.no_grad():
outputs = model(inputs_embeds=perturbed_embeddings)
new_tokens = outputs.logits.argmax(dim=-1)
return tokenizer.decode(new_tokens[0], skip_special_tokens=True)Implications for Model Provenance
The Watermark Arms Race
| Generation | Watermark Approach | Removal Attack | Result |
|---|---|---|---|
| 1st | Green-red token bias | Paraphrasing | Watermark removed |
| 2nd | Semantic watermarking | Style transfer | Partially effective |
| 3rd | Multi-level watermarking | Ensemble paraphrasing | Cat-and-mouse continues |
Current Limitations
Alternative Provenance Approaches
| Approach | Mechanism | Robustness to Removal |
|---|---|---|
| Text watermarking | Statistical bias in token selection | Low -- paraphrasing removes |
| Model fingerprinting | Unique behavioral signatures in model weights | Medium -- survives text-level attacks |
| Stylometric analysis | Statistical writing style features | Medium -- varies by text length |
| Retrieval-based detection | Compare against database of known outputs | High -- requires comprehensive database |
| Cryptographic signing | Sign outputs at generation time | High -- but requires trust in signing key |
Ethical Considerations
Related Topics
- Advanced Training Attack Vectors -- Overview of advanced training attacks
- Distillation-Based Model Extraction -- Model theft context
- Synthetic Data Pipeline Attacks -- Synthetic data provenance
- Defense Evasion -- Output filtering bypass
Why is LLM-based paraphrasing the most effective watermark removal technique?
References
- A Watermark for Large Language Models (Kirchenbauer et al., 2023) -- Green-red list watermarking
- On the Reliability of Watermarks for Large Language Models (Piet et al., 2023) -- Watermark robustness analysis
- Paraphrasing evades detectors of AI-generated text (Sadasivan et al., 2023) -- Paraphrasing attacks