AI Watermarking and Attacks
Current AI watermarking schemes for model outputs and training data, their security properties, and known attacks that remove, forge, or evade watermarks.
AI Watermarking and Attacks
AI watermarking aims to solve two critical problems: detecting whether content was generated by AI and attributing that content to a specific model or provider. As AI-generated content becomes indistinguishable from human-created content, watermarking has emerged as a primary technical approach for maintaining provenance and trust. However, every watermarking scheme must contend with fundamental trade-offs between detectability, robustness, and quality -- and adversaries actively exploit these trade-offs.
Text Watermarking Schemes
Green-Red Token Watermarking (Kirchenbauer et al.)
The most influential text watermarking approach partitions the vocabulary into "green" and "red" lists based on a hash of the preceding tokens, then biases generation toward green tokens:
import hashlib
import torch
class GreenRedWatermark:
"""Kirchenbauer et al. green-red list watermarking."""
def __init__(self, vocab_size, green_fraction=0.5,
watermark_strength=2.0, secret_key="key"):
self.vocab_size = vocab_size
self.green_fraction = green_fraction
self.strength = watermark_strength # delta parameter
self.key = secret_key
def get_green_tokens(self, prefix_token_id):
"""Compute green list based on preceding token."""
seed = hashlib.sha256(
f"{self.key}:{prefix_token_id}".encode()
).digest()
rng = torch.Generator().manual_seed(
int.from_bytes(seed[:4], "big")
)
# Random permutation of vocabulary
perm = torch.randperm(self.vocab_size, generator=rng)
# First green_fraction of permutation are green tokens
green_count = int(self.vocab_size * self.green_fraction)
green_tokens = set(perm[:green_count].tolist())
return green_tokens
def apply_watermark(self, logits, prefix_token_id):
"""Bias logits toward green tokens."""
green_tokens = self.get_green_tokens(prefix_token_id)
watermarked_logits = logits.clone()
for token_id in green_tokens:
watermarked_logits[token_id] += self.strength
return watermarked_logits
def detect_watermark(self, token_ids, threshold=4.0):
"""Detect watermark by counting green token frequency."""
green_count = 0
total = 0
for i in range(1, len(token_ids)):
green_tokens = self.get_green_tokens(token_ids[i - 1])
if token_ids[i] in green_tokens:
green_count += 1
total += 1
if total == 0:
return {"detected": False, "z_score": 0}
# Compute z-score under null hypothesis
# (random text has green_fraction green tokens)
expected = total * self.green_fraction
std = (total * self.green_fraction * (1 - self.green_fraction)) ** 0.5
z_score = (green_count - expected) / std
return {
"detected": z_score > threshold,
"z_score": z_score,
"green_fraction_observed": green_count / total,
"green_fraction_expected": self.green_fraction
}Distortion-Free Watermarking (Christ et al.)
This approach embeds watermarks without any quality degradation by using rejection sampling from the model's original distribution:
class DistortionFreeWatermark:
"""
Watermark that provably does not alter output distribution.
Uses shared randomness between encoder and decoder.
"""
def __init__(self, secret_key):
self.key = secret_key
def generate_watermarked(self, model, prompt, max_tokens=100):
"""Generate watermarked text using Gumbel softmax trick."""
tokens = []
current_input = prompt
for step in range(max_tokens):
# Get model's probability distribution
logits = model(current_input)
probs = torch.softmax(logits[-1], dim=-1)
# Generate pseudorandom Gumbel noise (keyed by position)
seed = self.position_seed(step, tokens)
gumbel_noise = self.sample_gumbel(probs.shape, seed)
# Argmax of log(probs) + gumbel_noise is equivalent
# to sampling from probs (Gumbel-max trick)
# But the specific noise is reproducible with the key
selected = (probs.log() + gumbel_noise).argmax()
tokens.append(selected.item())
current_input = torch.cat([current_input, selected.unsqueeze(0)])
return tokens
def detect(self, tokens, model, prompt):
"""Detect watermark by checking if tokens match expected noise."""
log_likelihood_ratio = 0
for step, token in enumerate(tokens):
logits = model(prompt) # Need to re-run model
probs = torch.softmax(logits[-1], dim=-1)
seed = self.position_seed(step, tokens[:step])
gumbel_noise = self.sample_gumbel(probs.shape, seed)
expected_token = (probs.log() + gumbel_noise).argmax()
if token == expected_token.item():
log_likelihood_ratio += 1
return log_likelihood_ratio / len(tokens)Watermark Attacks
Paraphrase Attack
The simplest and most effective attack: rewrite the watermarked text using another model:
def paraphrase_attack(watermarked_text, paraphraser_model):
"""
Remove watermark by paraphrasing with a non-watermarked model.
Effectiveness depends on paraphrase quality.
"""
prompt = f"""Rewrite the following text to convey the exact same
meaning using completely different wording and sentence structure.
Maintain the same level of detail and technical accuracy.
Original: {watermarked_text}
Rewritten version:"""
return paraphraser_model.generate(prompt)Token Substitution Attack
Systematically replace tokens to reduce green token frequency while preserving meaning:
def token_substitution_attack(text, tokenizer, synonym_dict,
substitution_rate=0.3):
"""
Replace a fraction of tokens with synonyms to disrupt watermark.
Lower substitution rates preserve quality but may leave
detectable watermark signal.
"""
tokens = tokenizer.tokenize(text)
modified = []
for i, token in enumerate(tokens):
if (random.random() < substitution_rate and
token in synonym_dict):
# Replace with random synonym
synonym = random.choice(synonym_dict[token])
modified.append(synonym)
else:
modified.append(token)
return tokenizer.detokenize(modified)Emoji and Unicode Attack
Insert zero-width characters or Unicode variations to disrupt token-level watermark detection:
def unicode_disruption_attack(text):
"""
Insert zero-width characters between tokens to disrupt
watermark detection that depends on token boundaries.
"""
zero_width_chars = [
'\u200b', # Zero-width space
'\u200c', # Zero-width non-joiner
'\u200d', # Zero-width joiner
'\ufeff', # Zero-width no-break space
]
words = text.split()
disrupted = []
for word in words:
# Randomly insert zero-width character within word
if len(word) > 3 and random.random() > 0.5:
pos = random.randint(1, len(word) - 1)
zwc = random.choice(zero_width_chars)
word = word[:pos] + zwc + word[pos:]
disrupted.append(word)
return ' '.join(disrupted)Spoofing Attack (Watermark Forgery)
If the watermarking key or algorithm is discovered, an attacker can embed false watermarks in human-written text:
def watermark_spoofing(human_text, watermark_key, tokenizer,
green_fraction=0.5):
"""
Add a false watermark to human-written text to frame it
as AI-generated, or to attribute it to a specific model.
"""
tokens = tokenizer.encode(human_text)
watermarker = GreenRedWatermark(
tokenizer.vocab_size, green_fraction,
secret_key=watermark_key
)
spoofed_tokens = []
for i, token in enumerate(tokens):
if i == 0:
spoofed_tokens.append(token)
continue
green_tokens = watermarker.get_green_tokens(spoofed_tokens[-1])
if token in green_tokens:
# Already green -- keep it
spoofed_tokens.append(token)
else:
# Find a green synonym
synonyms = find_synonyms(token, tokenizer)
green_synonyms = [s for s in synonyms if s in green_tokens]
if green_synonyms:
spoofed_tokens.append(random.choice(green_synonyms))
else:
spoofed_tokens.append(token)
return tokenizer.decode(spoofed_tokens)Recursive Watermark Removal
Use multiple models in sequence to progressively remove watermark signal:
def recursive_removal(text, models, rounds=3):
"""
Pass text through multiple non-watermarked models
to progressively destroy watermark signal.
"""
current = text
for round_num in range(rounds):
model = models[round_num % len(models)]
current = model.generate(
f"Faithfully reproduce the following text with minor "
f"stylistic improvements:\n\n{current}"
)
return currentImage Watermarking and Attacks
Stable Signature (Fernandez et al.)
Embeds watermarks by fine-tuning the decoder of a latent diffusion model:
class StableSignatureAttack:
"""Attacks against Stable Signature image watermarking."""
def regeneration_attack(self, watermarked_image, clean_model):
"""
Remove watermark by encoding image to latent space
with a clean (non-watermarked) model, then decoding.
"""
# Encode with any VAE encoder
latent = clean_model.encoder(watermarked_image)
# Add small noise to disrupt watermark in latent space
noisy_latent = latent + torch.randn_like(latent) * 0.05
# Decode with clean decoder (no watermark)
cleaned_image = clean_model.decoder(noisy_latent)
return cleaned_image
def adversarial_perturbation_attack(self, watermarked_image,
detector, epsilon=0.01):
"""
Add imperceptible perturbation that causes detector
to miss the watermark.
"""
image = watermarked_image.clone().requires_grad_(True)
optimizer = torch.optim.Adam([image], lr=0.001)
for _ in range(100):
optimizer.zero_grad()
# Minimize detector confidence
detection_score = detector(image)
loss = detection_score # Want to minimize
loss.backward()
optimizer.step()
# Project to epsilon ball
delta = image.data - watermarked_image
delta = torch.clamp(delta, -epsilon, epsilon)
image.data = watermarked_image + delta
return image.detach()Watermark Robustness Analysis
Comprehensive analysis of watermark robustness across attack types:
| Watermark Type | Paraphrase | Token Sub. | Spoofing | Quality Loss |
|---|---|---|---|---|
| Green-Red (soft) | Vulnerable | Partially robust | Vulnerable if key leaked | Minimal |
| Green-Red (hard) | Vulnerable | Vulnerable | Vulnerable if key leaked | Moderate |
| Distortion-free | Vulnerable | Vulnerable | Resistant | None |
| Semantic watermark | Partially robust | Robust | Hard to spoof | Variable |
| Image (Stable Sig.) | N/A | N/A | Requires model access | Minimal |
Implications for Trust and Governance
Watermarking occupies a central role in proposed AI governance frameworks, but its limitations have significant policy implications:
- False sense of security: Watermarks can be removed, creating false negatives (AI content not detected as AI)
- False accusations: Watermarks can be spoofed, creating false positives (human content flagged as AI)
- Arms race dynamics: As watermarking improves, so do attacks, creating an unstable equilibrium
- Coverage gaps: Not all AI providers watermark their outputs, and open-source models cannot be forced to implement watermarking
Related Topics
- Watermark & Fingerprint Evasion — Basic watermark evasion concepts
- Data Provenance — Tracking data through ML pipelines
- Model Extraction — Extracting models and bypassing IP protections
A regulatory agency proposes requiring all AI-generated text to carry watermarks. An adversary uses a non-watermarked open-source model to paraphrase watermarked text. What is the primary implication?
References
- Kirchenbauer et al., "A Watermark for Large Language Models" (2023)
- Christ et al., "Undetectable Watermarks for Language Models" (2023)
- Fernandez et al., "The Stable Signature: Rooting Watermarks in Latent Diffusion Models" (2023)
- Zhao et al., "Provable Robust Watermarking for AI-Generated Text" (2023)
- Sadasivan et al., "Can AI-Generated Text be Reliably Detected?" (2023)