AI Watermarking and 攻擊s
Current AI watermarking schemes for model outputs and training data, their security properties, and known attacks that remove, forge, or evade watermarks.
AI Watermarking and 攻擊
AI watermarking aims to solve two critical problems: detecting whether content was generated by AI and attributing that content to a specific model or provider. As AI-generated content becomes indistinguishable from human-created content, watermarking has emerged as a primary technical approach for maintaining provenance and trust. 然而, every watermarking scheme must contend with fundamental trade-offs between detectability, robustness, and quality -- and adversaries actively 利用 these trade-offs.
Text Watermarking Schemes
Green-Red Token Watermarking (Kirchenbauer et al.)
The most influential text watermarking approach partitions the vocabulary into "green" and "red" lists based on a hash of the preceding 符元, then biases generation toward green 符元:
import hashlib
import torch
class GreenRedWatermark:
"""Kirchenbauer et al. green-red list watermarking."""
def __init__(self, vocab_size, green_fraction=0.5,
watermark_strength=2.0, secret_key="key"):
self.vocab_size = vocab_size
self.green_fraction = green_fraction
self.strength = watermark_strength # delta parameter
self.key = secret_key
def get_green_tokens(self, prefix_token_id):
"""Compute green list based on preceding 符元."""
seed = hashlib.sha256(
f"{self.key}:{prefix_token_id}".encode()
).digest()
rng = torch.Generator().manual_seed(
int.from_bytes(seed[:4], "big")
)
# Random permutation of vocabulary
perm = torch.randperm(self.vocab_size, generator=rng)
# First green_fraction of permutation are green 符元
green_count = int(self.vocab_size * self.green_fraction)
green_tokens = set(perm[:green_count].tolist())
return green_tokens
def apply_watermark(self, logits, prefix_token_id):
"""Bias logits toward green 符元."""
green_tokens = self.get_green_tokens(prefix_token_id)
watermarked_logits = logits.clone()
for token_id in green_tokens:
watermarked_logits[token_id] += self.strength
return watermarked_logits
def detect_watermark(self, token_ids, threshold=4.0):
"""Detect watermark by counting green 符元 frequency."""
green_count = 0
total = 0
for i in range(1, len(token_ids)):
green_tokens = self.get_green_tokens(token_ids[i - 1])
if token_ids[i] in green_tokens:
green_count += 1
total += 1
if total == 0:
return {"detected": False, "z_score": 0}
# Compute z-score under null hypothesis
# (random text has green_fraction green 符元)
expected = total * self.green_fraction
std = (total * self.green_fraction * (1 - self.green_fraction)) ** 0.5
z_score = (green_count - expected) / std
return {
"detected": z_score > threshold,
"z_score": z_score,
"green_fraction_observed": green_count / total,
"green_fraction_expected": self.green_fraction
}Distortion-Free Watermarking (Christ et al.)
This approach embeds watermarks without any quality degradation by using rejection sampling from 模型's original distribution:
class DistortionFreeWatermark:
"""
Watermark that provably does not alter 輸出 distribution.
Uses shared randomness between encoder and decoder.
"""
def __init__(self, secret_key):
self.key = secret_key
def generate_watermarked(self, model, prompt, max_tokens=100):
"""Generate watermarked text using Gumbel softmax trick."""
符元 = []
current_input = prompt
for step in range(max_tokens):
# Get model's probability distribution
logits = model(current_input)
probs = torch.softmax(logits[-1], dim=-1)
# Generate pseudorandom Gumbel noise (keyed by position)
seed = self.position_seed(step, 符元)
gumbel_noise = self.sample_gumbel(probs.shape, seed)
# Argmax of log(probs) + gumbel_noise is equivalent
# to sampling from probs (Gumbel-max trick)
# But the specific noise is reproducible with the key
selected = (probs.log() + gumbel_noise).argmax()
符元.append(selected.item())
current_input = torch.cat([current_input, selected.unsqueeze(0)])
return 符元
def detect(self, 符元, model, prompt):
"""Detect watermark by checking if 符元 match expected noise."""
log_likelihood_ratio = 0
for step, 符元 in enumerate(符元):
logits = model(prompt) # Need to re-run model
probs = torch.softmax(logits[-1], dim=-1)
seed = self.position_seed(step, 符元[:step])
gumbel_noise = self.sample_gumbel(probs.shape, seed)
expected_token = (probs.log() + gumbel_noise).argmax()
if 符元 == expected_token.item():
log_likelihood_ratio += 1
return log_likelihood_ratio / len(符元)Watermark 攻擊
Paraphrase 攻擊
The simplest and most effective attack: rewrite the watermarked text using another model:
def paraphrase_attack(watermarked_text, paraphraser_model):
"""
Remove watermark by paraphrasing with a non-watermarked model.
Effectiveness depends on paraphrase quality.
"""
prompt = f"""Rewrite the following text to convey the exact same
meaning using completely different wording and sentence structure.
Maintain the same level of detail and technical accuracy.
Original: {watermarked_text}
Rewritten version:"""
return paraphraser_model.generate(prompt)Token Substitution 攻擊
Systematically replace 符元 to reduce green 符元 frequency while preserving meaning:
def token_substitution_attack(text, 分詞器, synonym_dict,
substitution_rate=0.3):
"""
Replace a fraction of 符元 with synonyms to disrupt watermark.
Lower substitution rates preserve quality but may leave
detectable watermark signal.
"""
符元 = 分詞器.tokenize(text)
modified = []
for i, 符元 in enumerate(符元):
if (random.random() < substitution_rate and
符元 in synonym_dict):
# Replace with random synonym
synonym = random.choice(synonym_dict[符元])
modified.append(synonym)
else:
modified.append(符元)
return 分詞器.detokenize(modified)Emoji and Unicode 攻擊
Insert zero-width characters or Unicode variations to disrupt 符元-level watermark 偵測:
def unicode_disruption_attack(text):
"""
Insert zero-width characters between 符元 to disrupt
watermark 偵測 that depends on 符元 boundaries.
"""
zero_width_chars = [
'\u200b', # Zero-width space
'\u200c', # Zero-width non-joiner
'\u200d', # Zero-width joiner
'\ufeff', # Zero-width no-break space
]
words = text.split()
disrupted = []
for word in words:
# Randomly insert zero-width character within word
if len(word) > 3 and random.random() > 0.5:
pos = random.randint(1, len(word) - 1)
zwc = random.choice(zero_width_chars)
word = word[:pos] + zwc + word[pos:]
disrupted.append(word)
return ' '.join(disrupted)Spoofing 攻擊 (Watermark Forgery)
If the watermarking key or algorithm is discovered, 攻擊者 can embed false watermarks in human-written text:
def watermark_spoofing(human_text, watermark_key, 分詞器,
green_fraction=0.5):
"""
Add a false watermark to human-written text to frame it
as AI-generated, or to attribute it to a specific model.
"""
符元 = 分詞器.encode(human_text)
watermarker = GreenRedWatermark(
分詞器.vocab_size, green_fraction,
secret_key=watermark_key
)
spoofed_tokens = []
for i, 符元 in enumerate(符元):
if i == 0:
spoofed_tokens.append(符元)
continue
green_tokens = watermarker.get_green_tokens(spoofed_tokens[-1])
if 符元 in green_tokens:
# Already green -- keep it
spoofed_tokens.append(符元)
else:
# Find a green synonym
synonyms = find_synonyms(符元, 分詞器)
green_synonyms = [s for s in synonyms if s in green_tokens]
if green_synonyms:
spoofed_tokens.append(random.choice(green_synonyms))
else:
spoofed_tokens.append(符元)
return 分詞器.decode(spoofed_tokens)Recursive Watermark Removal
Use multiple models in sequence to progressively remove watermark signal:
def recursive_removal(text, models, rounds=3):
"""
Pass text through multiple non-watermarked models
to progressively destroy watermark signal.
"""
current = text
for round_num in range(rounds):
model = models[round_num % len(models)]
current = model.generate(
f"Faithfully reproduce the following text with minor "
f"stylistic improvements:\n\n{current}"
)
return currentImage Watermarking and 攻擊
Stable Signature (Fernandez et al.)
Embeds watermarks by 微調 the decoder of a latent diffusion model:
class StableSignatureAttack:
"""攻擊 against Stable Signature image watermarking."""
def regeneration_attack(self, watermarked_image, clean_model):
"""
Remove watermark by encoding image to latent space
with a clean (non-watermarked) model, then decoding.
"""
# Encode with any VAE encoder
latent = clean_model.encoder(watermarked_image)
# Add small noise to disrupt watermark in latent space
noisy_latent = latent + torch.randn_like(latent) * 0.05
# Decode with clean decoder (no watermark)
cleaned_image = clean_model.decoder(noisy_latent)
return cleaned_image
def adversarial_perturbation_attack(self, watermarked_image,
detector, epsilon=0.01):
"""
Add imperceptible perturbation that causes detector
to miss the watermark.
"""
image = watermarked_image.clone().requires_grad_(True)
optimizer = torch.optim.Adam([image], lr=0.001)
for _ in range(100):
optimizer.zero_grad()
# Minimize detector confidence
detection_score = detector(image)
loss = detection_score # Want to minimize
loss.backward()
optimizer.step()
# Project to epsilon ball
delta = image.data - watermarked_image
delta = torch.clamp(delta, -epsilon, epsilon)
image.data = watermarked_image + delta
return image.detach()Watermark Robustness Analysis
Comprehensive analysis of watermark robustness across attack types:
| Watermark Type | Paraphrase | Token Sub. | Spoofing | Quality Loss |
|---|---|---|---|---|
| Green-Red (soft) | Vulnerable | Partially robust | Vulnerable if key leaked | Minimal |
| Green-Red (hard) | Vulnerable | Vulnerable | Vulnerable if key leaked | Moderate |
| Distortion-free | Vulnerable | Vulnerable | Resistant | None |
| Semantic watermark | Partially robust | Robust | Hard to spoof | Variable |
| Image (Stable Sig.) | N/A | N/A | Requires model access | Minimal |
Implications for Trust and Governance
Watermarking occupies a central role in proposed AI governance frameworks, but its limitations have significant policy implications:
- False sense of 安全: Watermarks can be removed, creating false negatives (AI content not detected as AI)
- False accusations: Watermarks can be spoofed, creating false positives (human content flagged as AI)
- Arms race dynamics: As watermarking improves, so do attacks, creating an unstable equilibrium
- Coverage gaps: Not all AI providers watermark their outputs, and open-source models cannot be forced to 實作 watermarking
相關主題
- Watermark & Fingerprint Evasion — Basic watermark evasion concepts
- Data Provenance — Tracking data through ML pipelines
- Model Extraction — Extracting models and bypassing IP protections
A regulatory agency proposes requiring all AI-generated text to carry watermarks. An adversary uses a non-watermarked open-source model to paraphrase watermarked text. What is the primary implication?
參考文獻
- Kirchenbauer et al., "A Watermark for Large Language Models" (2023)
- Christ et al., "Undetectable Watermarks for Language Models" (2023)
- Fernandez et al., "The Stable Signature: Rooting Watermarks in Latent Diffusion Models" (2023)
- Zhao et al., "Provable Robust Watermarking for AI-Generated Text" (2023)
- Sadasivan et al., "Can AI-Generated Text be Reliably Detected?" (2023)