Watermarking & AI-Generated Text Detection
Statistical watermarking schemes for LLM outputs, AI-generated text detectors, their cryptographic foundations, and systematic techniques for evading or removing watermarks.
Watermarking and AI-generated text detection represent the defender's attempt to maintain provenance over model outputs. For red teamers, understanding these mechanisms is essential -- both to assess their robustness and to evaluate whether downstream security assumptions built on watermark presence are sound.
Watermarking Scheme Taxonomy
| Scheme Type | Mechanism | Robustness | Quality Impact | Detection Complexity |
|---|---|---|---|---|
| Token-level (Kirchenbauer et al.) | Green/red list partitioning per token position | Moderate -- vulnerable to paraphrasing | Low at moderate bias (delta ~1-2) | O(n) with secret key |
| Distribution-shift | Subtle bias across full output distribution | Higher -- survives light editing | Very low | Requires statistical test |
| Semantic watermarking | Embeds signal in meaning rather than specific tokens | Highest -- survives paraphrasing | Moderate | Requires embedding model |
| Multi-bit | Encodes payload (user ID, timestamp) in watermark | Varies | Moderate -- more bits = more distortion | Requires key + decoder |
| Unbiased (Christ et al.) | Perfectly preserves output distribution | High theoretical guarantees | None (provably unbiased) | Requires full key |
How Token-Level Watermarking Works
The most widely deployed scheme (based on Kirchenbauer et al., 2023) operates during inference:
import hashlib
import numpy as np
class WatermarkLogitsProcessor:
"""Simplified watermark injection during LLM inference."""
def __init__(self, secret_key: bytes, gamma: float = 0.5, delta: float = 2.0):
self.secret_key = secret_key
self.gamma = gamma # fraction of vocab in green list
self.delta = delta # logit bias added to green-list tokens
def get_green_list(self, prev_token_id: int, vocab_size: int) -> set:
"""Deterministically partition vocab into green/red lists."""
seed = hashlib.sha256(
self.secret_key + prev_token_id.to_bytes(4, "big")
).digest()
rng = np.random.RandomState(
int.from_bytes(seed[:4], "big")
)
green_count = int(vocab_size * self.gamma)
green_list = set(rng.choice(vocab_size, green_count, replace=False))
return green_list
def __call__(self, prev_token_id: int, logits: np.ndarray) -> np.ndarray:
"""Add delta bias to green-list token logits."""
green_list = self.get_green_list(prev_token_id, len(logits))
modified = logits.copy()
for token_id in green_list:
modified[token_id] += self.delta
return modifiedDetection Algorithm
from scipy import stats
def detect_watermark(
text_token_ids: list[int],
secret_key: bytes,
vocab_size: int,
gamma: float = 0.5,
) -> dict:
"""Test whether text contains a statistical watermark."""
green_count = 0
total = len(text_token_ids) - 1 # skip first token (no predecessor)
for i in range(1, len(text_token_ids)):
prev_id = text_token_ids[i - 1]
green_list = get_green_list(prev_id, vocab_size, secret_key, gamma)
if text_token_ids[i] in green_list:
green_count += 1
# Under null hypothesis (no watermark), green fraction ~ gamma
z_score = (green_count - gamma * total) / np.sqrt(total * gamma * (1 - gamma))
p_value = 1 - stats.norm.cdf(z_score)
return {
"green_fraction": green_count / total,
"z_score": z_score,
"p_value": p_value,
"watermarked": z_score > 4.0, # threshold
}AI-Generated Text Detection (Non-Watermark)
Beyond watermarking, statistical detectors attempt to identify AI-generated text without cooperation from the generating model.
Detector Categories
| Detector Type | Approach | Strengths | Weaknesses |
|---|---|---|---|
| Perplexity-based | AI text has lower perplexity under a reference model | Simple, no training needed | High false-positive rate on formulaic human text |
| Trained classifiers | ML model trained on human vs. AI text pairs | Can capture subtle distributional patterns | Brittle to domain shift, new models |
| Zero-shot (DetectGPT) | Perturbation-based: AI text occupies local curvature maxima in log-probability space | No training data needed | Computationally expensive, less reliable on short text |
| Stylometric | Analyzes writing style features (sentence length variance, vocabulary richness) | Model-agnostic | Easily defeated by style-transfer |
DetectGPT Mechanism
def detect_gpt_score(
text: str,
scoring_model,
mask_model,
n_perturbations: int = 100,
) -> float:
"""
DetectGPT: AI text sits at local maxima of log-probability.
Perturbations should decrease log-prob for AI text,
but have random effect on human text.
"""
original_ll = scoring_model.log_likelihood(text)
perturbation_lls = []
for _ in range(n_perturbations):
# Mask random spans and fill with mask_model (e.g., T5)
perturbed = mask_and_fill(text, mask_model)
perturbation_lls.append(scoring_model.log_likelihood(perturbed))
mean_perturbed_ll = np.mean(perturbation_lls)
std_perturbed_ll = np.std(perturbation_lls)
# Normalized perturbation discrepancy
score = (original_ll - mean_perturbed_ll) / std_perturbed_ll
return score # high score -> likely AI-generatedRed Team Evasion Techniques
Watermark Removal
The most effective general-purpose attack. A second (unwatermarked) model paraphrases the watermarked output, destroying the token-level statistical signal while preserving meaning.
def paraphrase_attack(watermarked_text: str, paraphraser) -> str:
"""Use a separate model to remove watermark via paraphrasing."""
prompt = f"""Rewrite the following text to convey the same meaning
using different words and sentence structures. Preserve all
factual content but change the phrasing completely.
Text: {watermarked_text}
Rewritten:"""
return paraphraser.generate(prompt)Effectiveness: removes token-level watermarks with >95% success. Semantic watermarks are more resistant.
Systematically replace tokens with synonyms at a rate sufficient to break the green-list bias without changing meaning.
def synonym_substitution_attack(
text: str,
substitution_rate: float = 0.25,
synonym_dict: dict = None,
) -> str:
"""Replace ~25% of content words with synonyms."""
words = text.split()
for i, word in enumerate(words):
if random.random() < substitution_rate and word.lower() in synonym_dict:
words[i] = random.choice(synonym_dict[word.lower()])
return " ".join(words)Insert zero-width characters or homoglyphs between tokens to break the predecessor-based green list computation.
ZERO_WIDTH_CHARS = ["\u200b", "\u200c", "\u200d", "\ufeff"]
def unicode_injection_attack(text: str, injection_rate: float = 0.15) -> str:
"""Insert zero-width characters to disrupt token-chain watermark."""
chars = list(text)
positions = random.sample(range(len(chars)), int(len(chars) * injection_rate))
for pos in sorted(positions, reverse=True):
chars.insert(pos, random.choice(ZERO_WIDTH_CHARS))
return "".join(chars)Use the watermarked text as context for a non-watermarked model to regenerate equivalent content.
def cross_model_transfer(watermarked_text: str, clean_model) -> str:
"""Transfer content through a non-watermarked model."""
prompt = f"""Based on the following information, write a comprehensive
response covering the same topics and conclusions:
{watermarked_text}"""
return clean_model.generate(prompt)Detector Evasion
| Evasion Technique | Target Detector | Mechanism |
|---|---|---|
| Recursive paraphrasing | All statistical | Multiple rounds of rewriting compound noise |
| Adversarial perturbation | Trained classifiers | Gradient-based token substitution to minimize classifier confidence |
| Human-AI co-writing | Perplexity-based | Mix human and AI text to raise perplexity variance |
| Style transfer | Stylometric | Rewrite in a specific human author's style |
| Back-translation | Token-level watermark | Translate to another language and back |
Robustness Assessment Framework
When evaluating watermarking or detection systems during a red team engagement, use this systematic approach:
Baseline Measurement
Generate 100+ watermarked samples and verify detection rate. Establish the z-score distribution for confirmed watermarked text versus clean text.
Minimal Perturbation Testing
Apply increasing levels of perturbation (5%, 10%, 25%, 50% token replacement) and measure detection degradation. Find the threshold where detection fails.
Semantic Preservation Verification
For each successful evasion technique, verify that the semantic content is preserved using embedding similarity (cosine similarity > 0.90 on the original vs. evaded text).
False Positive Analysis
Run the detector on human-written text from the target domain. Report false positive rates by demographic and writing style, especially for non-native English speakers.
Adaptive Attack
If the watermark scheme uses a known architecture, design a targeted attack that exploits its specific weakness (e.g., known green-list structure, fixed window size).
Fundamental Limitations
| Limitation | Implication for Red Teams |
|---|---|
| Quality-robustness trade-off | Aggressive watermarks degrade output -- look for quality differences as detection signal |
| Key management | Compromised keys allow forgery of watermarks or targeted removal |
| Short text unreliability | Watermarks require ~200+ tokens for reliable detection |
| Multi-model pipelines | Text passing through multiple models loses watermark provenance |
| Fine-tuning removal | Fine-tuning a watermarked model can remove the watermarking behavior |
For related defensive techniques, see Input/Output Filtering Systems and Guardrails Architecture.
Related Topics
- Advanced Defense Techniques - Other cutting-edge defense research including instruction hierarchy and representation engineering
- Input/Output Filtering Systems - Complementary output scanning approaches
- Guardrails & Safety Layer Architecture - Where watermarking fits in the defense pipeline
- Inference: Sampling, Temperature & Generation - Decoding mechanics that watermarking modifies
References
- "A Watermark for Large Language Models" - Kirchenbauer et al. (2023) - The foundational paper introducing token-level green/red list watermarking for LLMs
- "Unbiased Watermark for Large Language Models" - Christ et al. (2024) - Research on provably unbiased watermarking that preserves output distribution quality
- "DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature" - Mitchell et al. (2023) - The perturbation-based AI text detection method using log-probability analysis
- "On the Reliability of Watermarks for Large Language Models" - Sadasivan et al. (2023) - Critical analysis of watermark robustness against paraphrase and editing attacks
- "Can AI-Generated Text be Reliably Detected?" - Sadasivan et al. (2023) - Research on fundamental limitations of AI text detection including false positive rates
What is the most effective general-purpose attack against token-level watermarking?