Watermarking & AI-Generated Text Detection
Statistical watermarking schemes for LLM outputs, AI-generated text detectors, their cryptographic foundations, and systematic techniques for evading or removing watermarks.
Watermarking and AI-generated text 偵測 represent the defender's attempt to maintain provenance over model outputs. For red teamers, 理解 these mechanisms is essential -- both to 評估 their robustness and to 評估 whether downstream 安全 assumptions built on watermark presence are sound.
Watermarking Scheme Taxonomy
| Scheme Type | Mechanism | Robustness | Quality Impact | 偵測 Complexity |
|---|---|---|---|---|
| Token-level (Kirchenbauer et al.) | Green/red list partitioning per 符元 position | Moderate -- vulnerable to paraphrasing | Low at moderate bias (delta ~1-2) | O(n) with secret key |
| Distribution-shift | Subtle bias across full 輸出 distribution | Higher -- survives light editing | Very low | Requires statistical 測試 |
| Semantic watermarking | Embeds signal in meaning rather than specific 符元 | Highest -- survives paraphrasing | Moderate | Requires 嵌入向量 model |
| Multi-bit | Encodes payload (user ID, timestamp) in watermark | Varies | Moderate -- more bits = more distortion | Requires key + decoder |
| Unbiased (Christ et al.) | Perfectly preserves 輸出 distribution | High theoretical guarantees | None (provably unbiased) | Requires full key |
How Token-Level Watermarking Works
The most widely deployed scheme (based on Kirchenbauer et al., 2023) operates during 推論:
import hashlib
import numpy as np
class WatermarkLogitsProcessor:
"""Simplified watermark injection during LLM 推論."""
def __init__(self, secret_key: bytes, gamma: float = 0.5, delta: float = 2.0):
self.secret_key = secret_key
self.gamma = gamma # fraction of vocab in green list
self.delta = delta # logit bias added to green-list 符元
def get_green_list(self, prev_token_id: int, vocab_size: int) -> set:
"""Deterministically partition vocab into green/red lists."""
seed = hashlib.sha256(
self.secret_key + prev_token_id.to_bytes(4, "big")
).digest()
rng = np.random.RandomState(
int.from_bytes(seed[:4], "big")
)
green_count = int(vocab_size * self.gamma)
green_list = set(rng.choice(vocab_size, green_count, replace=False))
return green_list
def __call__(self, prev_token_id: int, logits: np.ndarray) -> np.ndarray:
"""Add delta bias to green-list 符元 logits."""
green_list = self.get_green_list(prev_token_id, len(logits))
modified = logits.copy()
for token_id in green_list:
modified[token_id] += self.delta
return modified偵測 Algorithm
from scipy import stats
def detect_watermark(
text_token_ids: list[int],
secret_key: bytes,
vocab_size: int,
gamma: float = 0.5,
) -> dict:
"""測試 whether text contains a statistical watermark."""
green_count = 0
total = len(text_token_ids) - 1 # skip first 符元 (no predecessor)
for i in range(1, len(text_token_ids)):
prev_id = text_token_ids[i - 1]
green_list = get_green_list(prev_id, vocab_size, secret_key, gamma)
if text_token_ids[i] in green_list:
green_count += 1
# Under null hypothesis (no watermark), green fraction ~ gamma
z_score = (green_count - gamma * total) / np.sqrt(total * gamma * (1 - gamma))
p_value = 1 - stats.norm.cdf(z_score)
return {
"green_fraction": green_count / total,
"z_score": z_score,
"p_value": p_value,
"watermarked": z_score > 4.0, # threshold
}AI-Generated Text 偵測 (Non-Watermark)
Beyond watermarking, statistical detectors attempt to 識別 AI-generated text without cooperation from the generating model.
Detector Categories
| Detector Type | Approach | Strengths | Weaknesses |
|---|---|---|---|
| Perplexity-based | AI text has lower perplexity under a reference model | Simple, no 訓練 needed | High false-positive rate on formulaic human text |
| Trained classifiers | ML model trained on human vs. AI text pairs | Can capture subtle distributional patterns | Brittle to domain shift, new models |
| Zero-shot (DetectGPT) | Perturbation-based: AI text occupies local curvature maxima in log-probability space | No 訓練資料 needed | Computationally expensive, less reliable on short text |
| Stylometric | Analyzes writing style features (sentence length variance, vocabulary richness) | Model-agnostic | Easily defeated by style-transfer |
DetectGPT Mechanism
def detect_gpt_score(
text: str,
scoring_model,
mask_model,
n_perturbations: int = 100,
) -> float:
"""
DetectGPT: AI text sits at local maxima of log-probability.
Perturbations should decrease log-prob for AI text,
but have random effect on human text.
"""
original_ll = scoring_model.log_likelihood(text)
perturbation_lls = []
for _ in range(n_perturbations):
# Mask random spans and fill with mask_model (e.g., T5)
perturbed = mask_and_fill(text, mask_model)
perturbation_lls.append(scoring_model.log_likelihood(perturbed))
mean_perturbed_ll = np.mean(perturbation_lls)
std_perturbed_ll = np.std(perturbation_lls)
# Normalized perturbation discrepancy
score = (original_ll - mean_perturbed_ll) / std_perturbed_ll
return score # high score -> likely AI-generated紅隊 Evasion Techniques
Watermark Removal
The most effective general-purpose attack. A second (unwatermarked) model paraphrases the watermarked 輸出, destroying the 符元-level statistical signal while preserving meaning.
def paraphrase_attack(watermarked_text: str, paraphraser) -> str:
"""Use a separate model to remove watermark via paraphrasing."""
prompt = f"""Rewrite the following text to convey the same meaning
using different words and sentence structures. Preserve all
factual content but change the phrasing completely.
Text: {watermarked_text}
Rewritten:"""
return paraphraser.generate(prompt)Effectiveness: removes 符元-level watermarks with >95% success. Semantic watermarks are more resistant.
Systematically replace 符元 with synonyms at a rate sufficient to break the green-list bias without changing meaning.
def synonym_substitution_attack(
text: str,
substitution_rate: float = 0.25,
synonym_dict: dict = None,
) -> str:
"""Replace ~25% of content words with synonyms."""
words = text.split()
for i, word in enumerate(words):
if random.random() < substitution_rate and word.lower() in synonym_dict:
words[i] = random.choice(synonym_dict[word.lower()])
return " ".join(words)Insert zero-width characters or homoglyphs between 符元 to break the predecessor-based green list computation.
ZERO_WIDTH_CHARS = ["\u200b", "\u200c", "\u200d", "\ufeff"]
def unicode_injection_attack(text: str, injection_rate: float = 0.15) -> str:
"""Insert zero-width characters to disrupt 符元-chain watermark."""
chars = list(text)
positions = random.sample(range(len(chars)), int(len(chars) * injection_rate))
for pos in sorted(positions, reverse=True):
chars.insert(pos, random.choice(ZERO_WIDTH_CHARS))
return "".join(chars)Use the watermarked text as context for a non-watermarked model to regenerate equivalent content.
def cross_model_transfer(watermarked_text: str, clean_model) -> str:
"""Transfer content through a non-watermarked model."""
prompt = f"""Based on the following information, write a comprehensive
response covering the same topics and conclusions:
{watermarked_text}"""
return clean_model.generate(prompt)Detector Evasion
| Evasion Technique | Target Detector | Mechanism |
|---|---|---|
| Recursive paraphrasing | All statistical | Multiple rounds of rewriting compound noise |
| 對抗性 perturbation | Trained classifiers | Gradient-based 符元 substitution to minimize classifier confidence |
| Human-AI co-writing | Perplexity-based | Mix human and AI text to raise perplexity variance |
| Style transfer | Stylometric | Rewrite in a specific human author's style |
| Back-translation | Token-level watermark | Translate to another language and back |
Robustness 評估 Framework
When evaluating watermarking or 偵測 systems during a 紅隊 engagement, use this systematic approach:
Baseline Measurement
Generate 100+ watermarked samples and verify 偵測 rate. Establish the z-score distribution for confirmed watermarked text versus clean text.
Minimal Perturbation 測試
Apply increasing levels of perturbation (5%, 10%, 25%, 50% 符元 replacement) and measure 偵測 degradation. Find the threshold where 偵測 fails.
Semantic Preservation Verification
對每個 successful evasion technique, verify that the semantic content is preserved using 嵌入向量 similarity (cosine similarity > 0.90 on the original vs. evaded text).
False Positive Analysis
Run the detector on human-written text from the target domain. Report false positive rates by demographic and writing style, especially for non-native English speakers.
Adaptive 攻擊
If the watermark scheme uses a known architecture, design a targeted attack that exploits its specific weakness (e.g., known green-list structure, fixed window size).
Fundamental Limitations
| Limitation | Implication for Red Teams |
|---|---|
| Quality-robustness trade-off | Aggressive watermarks degrade 輸出 -- look for quality differences as 偵測 signal |
| Key management | Compromised keys allow forgery of watermarks or targeted removal |
| Short text unreliability | Watermarks require ~200+ 符元 for reliable 偵測 |
| Multi-model pipelines | Text passing through multiple models loses watermark provenance |
| 微調 removal | 微調 a watermarked model can remove the watermarking behavior |
For related defensive techniques, see 輸入/輸出 Filtering Systems and 護欄 Architecture.
相關主題
- Advanced 防禦 Techniques - Other cutting-edge 防禦 research including instruction hierarchy and representation engineering
- 輸入/輸出 Filtering Systems - Complementary 輸出 scanning approaches
- 護欄 & 安全 Layer Architecture - Where watermarking fits in the 防禦 pipeline
- Inference: Sampling, Temperature & Generation - Decoding mechanics that watermarking modifies
參考文獻
- "A Watermark for Large Language Models" - Kirchenbauer et al. (2023) - The foundational paper introducing 符元-level green/red list watermarking for LLMs
- "Unbiased Watermark for Large Language Models" - Christ et al. (2024) - Research on provably unbiased watermarking that preserves 輸出 distribution quality
- "DetectGPT: Zero-Shot Machine-Generated Text 偵測 using Probability Curvature" - Mitchell et al. (2023) - The perturbation-based AI text 偵測 method using log-probability analysis
- "On the Reliability of Watermarks for Large Language Models" - Sadasivan et al. (2023) - Critical analysis of watermark robustness against paraphrase and editing attacks
- "Can AI-Generated Text be Reliably Detected?" - Sadasivan et al. (2023) - Research on fundamental limitations of AI text 偵測 including false positive rates
What is the most effective general-purpose attack against 符元-level watermarking?