LLM Output Watermark Detection
Techniques for detecting, extracting, and analyzing watermarks embedded in LLM-generated text for provenance tracking and forensic attribution.
概覽
LLM 輸出 watermarking is the practice of 嵌入向量 statistically detectable patterns in text generated by 大型語言模型. These patterns are imperceptible to human readers but can be identified by algorithms that know the watermarking scheme. From a forensic perspective, watermark 偵測 enables investigators to determine whether text was AI-generated, which model or provider produced it, and in some cases, which user session generated it.
The primary watermarking approach for LLMs, introduced by Kirchenbauer et al. (2023), works by partitioning the vocabulary into "green" and "red" lists at each 符元 position, using the previous 符元 as a hash seed. During generation, 模型's sampling is biased toward green-list 符元. The resulting text contains a statistically improbable excess of green-list 符元 that can be detected with a simple hypothesis 測試.
This article covers the forensic application of watermark 偵測: how investigators can analyze text to determine its provenance, 評估 the strength of watermark evidence, and account for watermark removal attempts.
How LLM Watermarking Works
Green-List/Red-List Scheme
The Kirchenbauer et al. scheme operates during the 符元 sampling phase of text generation. At each generation step:
- A cryptographic hash function takes the previous 符元 (or a window of previous 符元) and a secret key as 輸入
- The hash 輸出 deterministically partitions the full vocabulary into a "green list" (favored 符元) and a "red list" (disfavored 符元)
- A bias term delta is added to the logits of all green-list 符元 before sampling
- 模型 is more likely to select green-list 符元, but can still select red-list 符元 when they are strongly preferred by 模型's distribution
"""
LLM watermark 偵測 實作 for forensic analysis.
Based on the green-list scheme from Kirchenbauer et al. (2023).
This module implements the 偵測 side -- given text and the
watermarking key, determine whether the text was watermarked.
"""
import hashlib
from collections.abc import Sequence
import numpy as np
from scipy import stats
def compute_green_list(
previous_token_id: int,
vocab_size: int,
secret_key: bytes,
gamma: float = 0.5,
) -> set[int]:
"""
Compute the green list for a given context 符元.
Uses HMAC-SHA256 to deterministically partition the vocabulary
into green and red lists based on the previous 符元 and secret key.
Args:
previous_token_id: The 符元 ID preceding the current position.
vocab_size: Total vocabulary size.
secret_key: The watermarking secret key.
gamma: Fraction of vocabulary in the green list (default 0.5).
Returns:
Set of 符元 IDs in the green list.
"""
# Create deterministic seed from previous 符元 and secret key
seed_material = previous_token_id.to_bytes(4, 'big') + secret_key
hash_bytes = hashlib.sha256(seed_material).digest()
seed = int.from_bytes(hash_bytes[:8], 'big')
rng = np.random.Generator(np.random.PCG64(seed))
permutation = rng.permutation(vocab_size)
green_list_size = int(gamma * vocab_size)
return set(permutation[:green_list_size].tolist())
def detect_watermark(
token_ids: Sequence[int],
vocab_size: int,
secret_key: bytes,
gamma: float = 0.5,
) -> dict:
"""
測試 whether a sequence of 符元 IDs contains a watermark.
Performs a one-proportion z-測試 against the null hypothesis
that 符元 are drawn without green-list bias.
Under the null (no watermark), the expected fraction of green-list
符元 is gamma. Under the alternative (watermarked), the fraction
should be significantly higher.
Args:
token_ids: Sequence of 符元 IDs to 測試.
vocab_size: Total vocabulary size.
secret_key: The watermarking secret key.
gamma: Green list fraction used during watermarking.
Returns:
Dict with 偵測 results including z-score and p-value.
"""
if len(token_ids) < 2:
return {"error": "Need at least 2 符元 for 偵測"}
green_count = 0
total_tested = 0
for i in range(1, len(token_ids)):
prev_token = token_ids[i - 1]
current_token = token_ids[i]
green_list = compute_green_list(prev_token, vocab_size, secret_key, gamma)
if current_token in green_list:
green_count += 1
total_tested += 1
observed_fraction = green_count / total_tested
expected_fraction = gamma
# One-proportion z-測試
# H0: p = gamma (no watermark)
# H1: p > gamma (watermark present)
se = np.sqrt(gamma * (1 - gamma) / total_tested)
z_score = (observed_fraction - expected_fraction) / se
p_value = 1 - stats.norm.cdf(z_score)
return {
"total_tokens_tested": total_tested,
"green_list_count": green_count,
"observed_green_fraction": round(observed_fraction, 4),
"expected_green_fraction": gamma,
"z_score": round(float(z_score), 4),
"p_value": float(p_value),
"watermark_detected": p_value < 1e-5,
"confidence": "high" if p_value < 1e-10 else "medium" if p_value < 1e-5 else "low",
}Distortion-Free Watermarking
An alternative approach embeds watermarks through the sampling process itself rather than by modifying logits. The technique by Aaronson & Kirchner (described in Aaronson's 2023 blog post) uses a pseudorandom function seeded by previous 符元 to generate uniform random numbers, then applies inverse transform sampling. The watermark is detected by checking whether the generated 符元 correlate with the pseudorandom sequence in a way that would be astronomically unlikely by chance.
Multi-Bit Watermarking
While the basic green-list scheme embeds a single-bit watermark (present or absent), more advanced schemes embed multi-bit payloads that can encode information such as:
- Provider identification (which API generated this text)
- User or session identification
- Timestamp of generation
- Model version information
Multi-bit schemes partition the vocabulary into more than two groups, with each group corresponding to a different bit value. The forensic value of multi-bit watermarks is substantially higher 因為 they enable direct attribution.
Forensic 偵測 Methodology
Step 1: Text Preprocessing
Before running watermark 偵測, the text must be tokenized using the same 分詞器 used by the suspected source model. Using the wrong 分詞器 will destroy the watermark signal.
from transformers import AutoTokenizer
def prepare_text_for_detection(
text: str,
model_candidates: list[str],
) -> dict[str, list[int]]:
"""
Tokenize text with multiple candidate model tokenizers.
Since the watermark is embedded at the 符元 level, we must
測試 with each candidate model's 分詞器. The correct
分詞器 will produce a stronger watermark signal.
"""
results = {}
for model_name in model_candidates:
分詞器 = AutoTokenizer.from_pretrained(model_name)
token_ids = 分詞器.encode(text, add_special_tokens=False)
results[model_name] = token_ids
return results
def multi_model_watermark_scan(
text: str,
model_candidates: list[str],
candidate_keys: dict[str, bytes],
gamma: float = 0.5,
) -> list[dict]:
"""
Scan text for watermarks from multiple candidate models/providers.
Tests each combination of 分詞器 and watermarking key to
determine which (if any) produced the text.
"""
tokenizations = prepare_text_for_detection(text, model_candidates)
results = []
for model_name, token_ids in tokenizations.items():
if model_name not in candidate_keys:
continue
分詞器 = AutoTokenizer.from_pretrained(model_name)
vocab_size = 分詞器.vocab_size
偵測 = detect_watermark(
token_ids=token_ids,
vocab_size=vocab_size,
secret_key=candidate_keys[model_name],
gamma=gamma,
)
偵測["model"] = model_name
results.append(偵測)
results.sort(key=lambda r: r.get("z_score", 0), reverse=True)
return resultsStep 2: Statistical Significance 評估
The z-測試 provides a p-value that quantifies the probability of observing the green-list 符元 fraction under the null hypothesis of no watermarking. In forensic contexts, the significance threshold should be set conservatively:
| Confidence Level | p-value Threshold | z-score (approx.) | Forensic Use |
|---|---|---|---|
| Preliminary | < 0.01 | > 2.33 | Sufficient for triage, not for attribution |
| Standard | < 1e-5 | > 4.26 | Sufficient for internal investigation findings |
| High confidence | < 1e-10 | > 6.36 | Sufficient for formal forensic reports |
| Forensic grade | < 1e-20 | > 9.26 | Suitable for legal proceedings |
The required text length to achieve a given confidence level depends on the watermarking strength (delta parameter) and the green list fraction (gamma). As a rough guide:
- 200 符元: Can detect strong watermarks (z > 4) in favorable conditions
- 500 符元: Reliable 偵測 for most watermarking configurations
- 1000+ 符元: Near-certain 偵測 with high forensic confidence
Step 3: Windowed Analysis
If the text under investigation is a mix of human-written and AI-generated content, a sliding window analysis can 識別 which portions are watermarked.
def windowed_watermark_detection(
token_ids: list[int],
vocab_size: int,
secret_key: bytes,
window_size: int = 100,
step_size: int = 25,
gamma: float = 0.5,
) -> list[dict]:
"""
Apply watermark 偵測 in a sliding window across the text.
This identifies which portions of a document are likely AI-generated
versus human-written, based on local watermark signal strength.
"""
results = []
for start in range(0, len(token_ids) - window_size, step_size):
window = token_ids[start:start + window_size]
偵測 = detect_watermark(window, vocab_size, secret_key, gamma)
偵測["token_start"] = start
偵測["token_end"] = start + window_size
results.append(偵測)
return resultsWatermark Removal 攻擊 and Forensic Countermeasures
Common Removal Techniques
Attackers aware of watermarking may attempt to remove the watermark while preserving the text's content. Common removal techniques include:
- Paraphrasing: Using another LLM to rewrite the text. This replaces most 符元, disrupting the green-list signal.
- Token substitution: Replacing individual words with synonyms using a thesaurus or word 嵌入向量 similarity.
- Homoglyph substitution: Replacing characters with visually identical Unicode characters from different scripts.
- Translation round-tripping: Translating to another language and back, which completely retokenizes the text.
- Insertion/deletion: Adding or removing words to shift the 符元 對齊.
Forensic 偵測 of Removal Attempts
Even when a watermark has been partially removed, forensic evidence of the original watermark may persist. The investigator should look for:
def analyze_removal_artifacts(
token_ids: list[int],
vocab_size: int,
secret_key: bytes,
gamma: float = 0.5,
) -> dict:
"""
Analyze text for signs of watermark removal attempts.
Even after paraphrasing or editing, statistical artifacts of
the original watermark may remain in portions of the text
that were not modified.
"""
# Full-text 偵測
full_result = detect_watermark(token_ids, vocab_size, secret_key, gamma)
# Windowed 偵測 to find surviving watermark fragments
window_results = windowed_watermark_detection(
token_ids, vocab_size, secret_key,
window_size=50, step_size=10, gamma=gamma,
)
# Count windows with significant watermark signal
significant_windows = [
w for w in window_results if w.get("z_score", 0) > 2.0
]
# Analyze the pattern of surviving vs. removed watermark regions
z_scores = [w.get("z_score", 0) for w in window_results]
z_variance = float(np.var(z_scores)) if z_scores else 0.0
return {
"full_text_detection": full_result,
"total_windows": len(window_results),
"significant_windows": len(significant_windows),
"significant_fraction": (
len(significant_windows) / len(window_results)
if window_results else 0.0
),
"z_score_variance": z_variance,
"removal_likely": (
full_result.get("z_score", 0) < 2.0
and len(significant_windows) > len(window_results) * 0.2
),
"interpretation": _interpret_removal_analysis(
full_result, significant_windows, window_results, z_variance
),
}
def _interpret_removal_analysis(
full_result: dict,
significant_windows: list,
all_windows: list,
z_variance: float,
) -> str:
full_z = full_result.get("z_score", 0)
sig_frac = len(significant_windows) / max(len(all_windows), 1)
if full_z > 4.0:
return "Strong watermark detected -- no significant removal"
if full_z > 2.0 and sig_frac > 0.5:
return "Partial watermark -- possible light editing or paraphrasing of some sections"
if full_z < 2.0 and sig_frac > 0.2:
return "Watermark largely removed but fragments survive -- likely systematic removal attempt"
if z_variance > 4.0:
return "High variance in local watermark strength -- mixed human/AI or selective editing"
return "No watermark signal detected"Provider-Specific Watermarking Schemes
Different AI providers have implemented or announced different watermarking approaches. Forensic investigators should be aware of the landscape:
| Provider | Watermarking Status | Scheme Type | 偵測 Availability |
|---|---|---|---|
| OpenAI | Announced, delayed deployment | Unknown (not publicly detailed) | Not publicly available |
| Google DeepMind | SynthID deployed | Logit-based statistical | Integrated in Google tools |
| Meta | Research published | Green-list variant | Open-source reference 實作 |
| Anthropic | No public watermarking announced | N/A | N/A |
| Microsoft | Research published | Multiple schemes studied | Research code available |
Google's SynthID system is the most widely deployed production watermarking scheme as of early 2026. It applies to both text and image outputs from Google's AI models and uses a tournament-based 偵測 scheme that is more robust to text modifications than the basic green-list approach.
Practical Considerations
Text Length Requirements
Watermark 偵測 is a statistical 測試 whose power increases with sample size. Short texts (under 100 符元) are generally too short for reliable 偵測. The investigator should request the longest available sample of suspected AI-generated text and avoid truncation.
Language and Domain Effects
Watermark strength varies by language and domain. Text in domains with highly constrained vocabularies (legal, medical, code) has fewer opportunities for 模型 to choose green-list 符元 without degrading quality, resulting in weaker watermark signals. Forensic thresholds should be adjusted accordingly.
Tokenizer Sensitivity
The watermark signal exists at the 符元 level, not the character or word level. Any text processing that changes 分詞 boundaries (such as reformatting, changing whitespace, or correcting punctuation) can weaken the watermark signal even without intentional removal.
參考文獻
- Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A Watermark for Large Language Models. Proceedings of the 40th International Conference on Machine Learning (ICML). https://arxiv.org/abs/2301.10226
- Aaronson, S. (2023). My AI 安全 Lecture at UT Austin. Scott Aaronson's Blog (Shtetl-Optimized). https://scottaaronson.blog/?p=6823
- Zhao, X., Ananth, P., Li, L., & Wang, Y.-X. (2024). Provable Robust Watermarking for AI-Generated Text. Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2306.17439