AI 浮水印與攻擊
進階6 分鐘閱讀更新於 2026-03-15
當前針對模型輸出與訓練資料的 AI 浮水印方案、其安全屬性,以及已知可移除、偽造或規避浮水印的攻擊。
AI 浮水印與攻擊
AI 浮水印旨在解決兩個關鍵問題:偵測內容是否由 AI 生成,以及將該內容歸屬到特定模型或供應商。當 AI 生成的內容越來越難與人類創作區分,浮水印就成為維持來源可追溯與信任的主要技術路徑。然而,每個浮水印方案都必須在可偵測性、強健性與品質之間做出根本取捨,而攻擊者積極利用這些取捨。
文字浮水印方案
綠-紅符元浮水印(Kirchenbauer 等人)
最具影響力的文字浮水印方法,根據前綴符元的雜湊將詞彙劃分為「綠」與「紅」清單,然後在生成時偏向綠色符元:
import hashlib
import torch
class GreenRedWatermark:
"""Kirchenbauer et al. green-red list watermarking."""
def __init__(self, vocab_size, green_fraction=0.5,
watermark_strength=2.0, secret_key="key"):
self.vocab_size = vocab_size
self.green_fraction = green_fraction
self.strength = watermark_strength # delta parameter
self.key = secret_key
def get_green_tokens(self, prefix_token_id):
"""Compute green list based on preceding token."""
seed = hashlib.sha256(
f"{self.key}:{prefix_token_id}".encode()
).digest()
rng = torch.Generator().manual_seed(
int.from_bytes(seed[:4], "big")
)
# Random permutation of vocabulary
perm = torch.randperm(self.vocab_size, generator=rng)
# First green_fraction of permutation are green tokens
green_count = int(self.vocab_size * self.green_fraction)
green_tokens = set(perm[:green_count].tolist())
return green_tokens
def apply_watermark(self, logits, prefix_token_id):
"""Bias logits toward green tokens."""
green_tokens = self.get_green_tokens(prefix_token_id)
watermarked_logits = logits.clone()
for token_id in green_tokens:
watermarked_logits[token_id] += self.strength
return watermarked_logits
def detect_watermark(self, token_ids, threshold=4.0):
"""Detect watermark by counting green token frequency."""
green_count = 0
total = 0
for i in range(1, len(token_ids)):
green_tokens = self.get_green_tokens(token_ids[i - 1])
if token_ids[i] in green_tokens:
green_count += 1
total += 1
if total == 0:
return {"detected": False, "z_score": 0}
# Compute z-score under null hypothesis
# (random text has green_fraction green tokens)
expected = total * self.green_fraction
std = (total * self.green_fraction * (1 - self.green_fraction)) ** 0.5
z_score = (green_count - expected) / std
return {
"detected": z_score > threshold,
"z_score": z_score,
"green_fraction_observed": green_count / total,
"green_fraction_expected": self.green_fraction
}無失真浮水印(Christ 等人)
此方法使用從模型原始分布的拒絕取樣嵌入浮水印,可證明不會造成任何品質下降:
class DistortionFreeWatermark:
"""
Watermark that provably does not alter output distribution.
Uses shared randomness between encoder and decoder.
"""
def __init__(self, secret_key):
self.key = secret_key
def generate_watermarked(self, model, prompt, max_tokens=100):
"""Generate watermarked text using Gumbel softmax trick."""
tokens = []
current_input = prompt
for step in range(max_tokens):
# Get model's probability distribution
logits = model(current_input)
probs = torch.softmax(logits[-1], dim=-1)
# Generate pseudorandom Gumbel noise (keyed by position)
seed = self.position_seed(step, tokens)
gumbel_noise = self.sample_gumbel(probs.shape, seed)
# Argmax of log(probs) + gumbel_noise is equivalent
# to sampling from probs (Gumbel-max trick)
# But the specific noise is reproducible with the key
selected = (probs.log() + gumbel_noise).argmax()
tokens.append(selected.item())
current_input = torch.cat([current_input, selected.unsqueeze(0)])
return tokens
def detect(self, tokens, model, prompt):
"""Detect watermark by checking if tokens match expected noise."""
log_likelihood_ratio = 0
for step, token in enumerate(tokens):
logits = model(prompt) # Need to re-run model
probs = torch.softmax(logits[-1], dim=-1)
seed = self.position_seed(step, tokens[:step])
gumbel_noise = self.sample_gumbel(probs.shape, seed)
expected_token = (probs.log() + gumbel_noise).argmax()
if token == expected_token.item():
log_likelihood_ratio += 1
return log_likelihood_ratio / len(tokens)浮水印攻擊
改寫攻擊
最簡單且最有效的攻擊:以另一個模型改寫被加浮水印的文字:
def paraphrase_attack(watermarked_text, paraphraser_model):
"""
Remove watermark by paraphrasing with a non-watermarked model.
Effectiveness depends on paraphrase quality.
"""
prompt = f"""Rewrite the following text to convey the exact same
meaning using completely different wording and sentence structure.
Maintain the same level of detail and technical accuracy.
Original: {watermarked_text}
Rewritten version:"""
return paraphraser_model.generate(prompt)符元替換攻擊
系統化地替換符元以降低綠符元頻率,同時保留意義:
def token_substitution_attack(text, tokenizer, synonym_dict,
substitution_rate=0.3):
"""
Replace a fraction of tokens with synonyms to disrupt watermark.
Lower substitution rates preserve quality but may leave
detectable watermark signal.
"""
tokens = tokenizer.tokenize(text)
modified = []
for i, token in enumerate(tokens):
if (random.random() < substitution_rate and
token in synonym_dict):
# Replace with random synonym
synonym = random.choice(synonym_dict[token])
modified.append(synonym)
else:
modified.append(token)
return tokenizer.detokenize(modified)表情符號與 Unicode 攻擊
插入零寬字元或 Unicode 變體以干擾符元層級的浮水印偵測:
def unicode_disruption_attack(text):
"""
Insert zero-width characters between tokens to disrupt
watermark detection that depends on token boundaries.
"""
zero_width_chars = [
'\u200b', # Zero-width space
'\u200c', # Zero-width non-joiner
'\u200d', # Zero-width joiner
'\ufeff', # Zero-width no-break space
]
words = text.split()
disrupted = []
for word in words:
# Randomly insert zero-width character within word
if len(word) > 3 and random.random() > 0.5:
pos = random.randint(1, len(word) - 1)
zwc = random.choice(zero_width_chars)
word = word[:pos] + zwc + word[pos:]
disrupted.append(word)
return ' '.join(disrupted)偽造攻擊(浮水印偽造)
若浮水印金鑰或演算法被發現,攻擊者可在人類撰寫的文字中嵌入假浮水印:
def watermark_spoofing(human_text, watermark_key, tokenizer,
green_fraction=0.5):
"""
Add a false watermark to human-written text to frame it
as AI-generated, or to attribute it to a specific model.
"""
tokens = tokenizer.encode(human_text)
watermarker = GreenRedWatermark(
tokenizer.vocab_size, green_fraction,
secret_key=watermark_key
)
spoofed_tokens = []
for i, token in enumerate(tokens):
if i == 0:
spoofed_tokens.append(token)
continue
green_tokens = watermarker.get_green_tokens(spoofed_tokens[-1])
if token in green_tokens:
# Already green -- keep it
spoofed_tokens.append(token)
else:
# Find a green synonym
synonyms = find_synonyms(token, tokenizer)
green_synonyms = [s for s in synonyms if s in green_tokens]
if green_synonyms:
spoofed_tokens.append(random.choice(green_synonyms))
else:
spoofed_tokens.append(token)
return tokenizer.decode(spoofed_tokens)遞迴式浮水印移除
依序使用多個模型以漸進方式移除浮水印訊號:
def recursive_removal(text, models, rounds=3):
"""
Pass text through multiple non-watermarked models
to progressively destroy watermark signal.
"""
current = text
for round_num in range(rounds):
model = models[round_num % len(models)]
current = model.generate(
f"Faithfully reproduce the following text with minor "
f"stylistic improvements:\n\n{current}"
)
return current影像浮水印與攻擊
Stable Signature(Fernandez 等人)
透過微調潛在擴散模型的解碼器嵌入浮水印:
class StableSignatureAttack:
"""Attacks against Stable Signature image watermarking."""
def regeneration_attack(self, watermarked_image, clean_model):
"""
Remove watermark by encoding image to latent space
with a clean (non-watermarked) model, then decoding.
"""
# Encode with any VAE encoder
latent = clean_model.encoder(watermarked_image)
# Add small noise to disrupt watermark in latent space
noisy_latent = latent + torch.randn_like(latent) * 0.05
# Decode with clean decoder (no watermark)
cleaned_image = clean_model.decoder(noisy_latent)
return cleaned_image
def adversarial_perturbation_attack(self, watermarked_image,
detector, epsilon=0.01):
"""
Add imperceptible perturbation that causes detector
to miss the watermark.
"""
image = watermarked_image.clone().requires_grad_(True)
optimizer = torch.optim.Adam([image], lr=0.001)
for _ in range(100):
optimizer.zero_grad()
# Minimize detector confidence
detection_score = detector(image)
loss = detection_score # Want to minimize
loss.backward()
optimizer.step()
# Project to epsilon ball
delta = image.data - watermarked_image
delta = torch.clamp(delta, -epsilon, epsilon)
image.data = watermarked_image + delta
return image.detach()浮水印強健性分析
各攻擊型對浮水印強健性的綜合分析:
| 浮水印類型 | 改寫 | 符元替換 | 偽造 | 品質損失 |
|---|---|---|---|---|
| 綠-紅(軟) | 脆弱 | 部分強健 | 金鑰外洩時脆弱 | 極小 |
| 綠-紅(硬) | 脆弱 | 脆弱 | 金鑰外洩時脆弱 | 中等 |
| 無失真 | 脆弱 | 脆弱 | 有抵抗力 | 無 |
| 語意浮水印 | 部分強健 | 強健 | 難以偽造 | 視情況 |
| 影像(Stable Sig.) | N/A | N/A | 需模型存取 | 極小 |
對信任與治理的意涵
浮水印在多項 AI 治理框架提案中扮演核心角色,但其限制具有重大政策意涵:
- 虛假安全感:浮水印可被移除,造成假陰性(AI 內容未被偵測為 AI)
- 不實指控:浮水印可被偽造,造成假陽性(人類內容被標記為 AI)
- 軍備競賽動態:浮水印改進的同時攻擊也隨之進步,造成不穩定的均衡
- 覆蓋缺口:並非所有 AI 供應商都為輸出加上浮水印,開源模型也無法被強制實作浮水印
相關主題
Knowledge Check
監管機構提議要求所有 AI 生成文字都必須帶有浮水印。攻擊者使用未加浮水印的開源模型改寫帶浮水印的文字。主要影響為何?
參考資料
- Kirchenbauer et al., "A Watermark for Large Language Models" (2023)
- Christ et al., "Undetectable Watermarks for Language Models" (2023)
- Fernandez et al., "The Stable Signature: Rooting Watermarks in Latent Diffusion Models" (2023)
- Zhao et al., "Provable Robust Watermarking for AI-Generated Text" (2023)
- Sadasivan et al., "Can AI-Generated Text be Reliably Detected?" (2023)