LLM Output Watermark Detection
Techniques for detecting, extracting, and analyzing watermarks embedded in LLM-generated text for provenance tracking and forensic attribution.
Overview
LLM output watermarking is the practice of embedding statistically detectable patterns in text generated by large language models. These patterns are imperceptible to human readers but can be identified by algorithms that know the watermarking scheme. From a forensic perspective, watermark detection enables investigators to determine whether text was AI-generated, which model or provider produced it, and in some cases, which user session generated it.
The primary watermarking approach for LLMs, introduced by Kirchenbauer et al. (2023), works by partitioning the vocabulary into "green" and "red" lists at each token position, using the previous token as a hash seed. During generation, the model's sampling is biased toward green-list tokens. The resulting text contains a statistically improbable excess of green-list tokens that can be detected with a simple hypothesis test.
This article covers the forensic application of watermark detection: how investigators can analyze text to determine its provenance, assess the strength of watermark evidence, and account for watermark removal attempts.
How LLM Watermarking Works
Green-List/Red-List Scheme
The Kirchenbauer et al. scheme operates during the token sampling phase of text generation. At each generation step:
- A cryptographic hash function takes the previous token (or a window of previous tokens) and a secret key as input
- The hash output deterministically partitions the full vocabulary into a "green list" (favored tokens) and a "red list" (disfavored tokens)
- A bias term delta is added to the logits of all green-list tokens before sampling
- The model is more likely to select green-list tokens, but can still select red-list tokens when they are strongly preferred by the model's distribution
"""
LLM watermark detection implementation for forensic analysis.
Based on the green-list scheme from Kirchenbauer et al. (2023).
This module implements the detection side -- given text and the
watermarking key, determine whether the text was watermarked.
"""
import hashlib
from collections.abc import Sequence
import numpy as np
from scipy import stats
def compute_green_list(
previous_token_id: int,
vocab_size: int,
secret_key: bytes,
gamma: float = 0.5,
) -> set[int]:
"""
Compute the green list for a given context token.
Uses HMAC-SHA256 to deterministically partition the vocabulary
into green and red lists based on the previous token and secret key.
Args:
previous_token_id: The token ID preceding the current position.
vocab_size: Total vocabulary size.
secret_key: The watermarking secret key.
gamma: Fraction of vocabulary in the green list (default 0.5).
Returns:
Set of token IDs in the green list.
"""
# Create deterministic seed from previous token and secret key
seed_material = previous_token_id.to_bytes(4, 'big') + secret_key
hash_bytes = hashlib.sha256(seed_material).digest()
seed = int.from_bytes(hash_bytes[:8], 'big')
rng = np.random.Generator(np.random.PCG64(seed))
permutation = rng.permutation(vocab_size)
green_list_size = int(gamma * vocab_size)
return set(permutation[:green_list_size].tolist())
def detect_watermark(
token_ids: Sequence[int],
vocab_size: int,
secret_key: bytes,
gamma: float = 0.5,
) -> dict:
"""
Test whether a sequence of token IDs contains a watermark.
Performs a one-proportion z-test against the null hypothesis
that tokens are drawn without green-list bias.
Under the null (no watermark), the expected fraction of green-list
tokens is gamma. Under the alternative (watermarked), the fraction
should be significantly higher.
Args:
token_ids: Sequence of token IDs to test.
vocab_size: Total vocabulary size.
secret_key: The watermarking secret key.
gamma: Green list fraction used during watermarking.
Returns:
Dict with detection results including z-score and p-value.
"""
if len(token_ids) < 2:
return {"error": "Need at least 2 tokens for detection"}
green_count = 0
total_tested = 0
for i in range(1, len(token_ids)):
prev_token = token_ids[i - 1]
current_token = token_ids[i]
green_list = compute_green_list(prev_token, vocab_size, secret_key, gamma)
if current_token in green_list:
green_count += 1
total_tested += 1
observed_fraction = green_count / total_tested
expected_fraction = gamma
# One-proportion z-test
# H0: p = gamma (no watermark)
# H1: p > gamma (watermark present)
se = np.sqrt(gamma * (1 - gamma) / total_tested)
z_score = (observed_fraction - expected_fraction) / se
p_value = 1 - stats.norm.cdf(z_score)
return {
"total_tokens_tested": total_tested,
"green_list_count": green_count,
"observed_green_fraction": round(observed_fraction, 4),
"expected_green_fraction": gamma,
"z_score": round(float(z_score), 4),
"p_value": float(p_value),
"watermark_detected": p_value < 1e-5,
"confidence": "high" if p_value < 1e-10 else "medium" if p_value < 1e-5 else "low",
}Distortion-Free Watermarking
An alternative approach embeds watermarks through the sampling process itself rather than by modifying logits. The technique by Aaronson & Kirchner (described in Aaronson's 2023 blog post) uses a pseudorandom function seeded by previous tokens to generate uniform random numbers, then applies inverse transform sampling. The watermark is detected by checking whether the generated tokens correlate with the pseudorandom sequence in a way that would be astronomically unlikely by chance.
Multi-Bit Watermarking
While the basic green-list scheme embeds a single-bit watermark (present or absent), more advanced schemes embed multi-bit payloads that can encode information such as:
- Provider identification (which API generated this text)
- User or session identification
- Timestamp of generation
- Model version information
Multi-bit schemes partition the vocabulary into more than two groups, with each group corresponding to a different bit value. The forensic value of multi-bit watermarks is substantially higher because they enable direct attribution.
Forensic Detection Methodology
Step 1: Text Preprocessing
Before running watermark detection, the text must be tokenized using the same tokenizer used by the suspected source model. Using the wrong tokenizer will destroy the watermark signal.
from transformers import AutoTokenizer
def prepare_text_for_detection(
text: str,
model_candidates: list[str],
) -> dict[str, list[int]]:
"""
Tokenize text with multiple candidate model tokenizers.
Since the watermark is embedded at the token level, we must
test with each candidate model's tokenizer. The correct
tokenizer will produce a stronger watermark signal.
"""
results = {}
for model_name in model_candidates:
tokenizer = AutoTokenizer.from_pretrained(model_name)
token_ids = tokenizer.encode(text, add_special_tokens=False)
results[model_name] = token_ids
return results
def multi_model_watermark_scan(
text: str,
model_candidates: list[str],
candidate_keys: dict[str, bytes],
gamma: float = 0.5,
) -> list[dict]:
"""
Scan text for watermarks from multiple candidate models/providers.
Tests each combination of tokenizer and watermarking key to
determine which (if any) produced the text.
"""
tokenizations = prepare_text_for_detection(text, model_candidates)
results = []
for model_name, token_ids in tokenizations.items():
if model_name not in candidate_keys:
continue
tokenizer = AutoTokenizer.from_pretrained(model_name)
vocab_size = tokenizer.vocab_size
detection = detect_watermark(
token_ids=token_ids,
vocab_size=vocab_size,
secret_key=candidate_keys[model_name],
gamma=gamma,
)
detection["model"] = model_name
results.append(detection)
results.sort(key=lambda r: r.get("z_score", 0), reverse=True)
return resultsStep 2: Statistical Significance Assessment
The z-test provides a p-value that quantifies the probability of observing the green-list token fraction under the null hypothesis of no watermarking. In forensic contexts, the significance threshold should be set conservatively:
| Confidence Level | p-value Threshold | z-score (approx.) | Forensic Use |
|---|---|---|---|
| Preliminary | < 0.01 | > 2.33 | Sufficient for triage, not for attribution |
| Standard | < 1e-5 | > 4.26 | Sufficient for internal investigation findings |
| High confidence | < 1e-10 | > 6.36 | Sufficient for formal forensic reports |
| Forensic grade | < 1e-20 | > 9.26 | Suitable for legal proceedings |
The required text length to achieve a given confidence level depends on the watermarking strength (delta parameter) and the green list fraction (gamma). As a rough guide:
- 200 tokens: Can detect strong watermarks (z > 4) in favorable conditions
- 500 tokens: Reliable detection for most watermarking configurations
- 1000+ tokens: Near-certain detection with high forensic confidence
Step 3: Windowed Analysis
If the text under investigation is a mix of human-written and AI-generated content, a sliding window analysis can identify which portions are watermarked.
def windowed_watermark_detection(
token_ids: list[int],
vocab_size: int,
secret_key: bytes,
window_size: int = 100,
step_size: int = 25,
gamma: float = 0.5,
) -> list[dict]:
"""
Apply watermark detection in a sliding window across the text.
This identifies which portions of a document are likely AI-generated
versus human-written, based on local watermark signal strength.
"""
results = []
for start in range(0, len(token_ids) - window_size, step_size):
window = token_ids[start:start + window_size]
detection = detect_watermark(window, vocab_size, secret_key, gamma)
detection["token_start"] = start
detection["token_end"] = start + window_size
results.append(detection)
return resultsWatermark Removal Attacks and Forensic Countermeasures
Common Removal Techniques
Attackers aware of watermarking may attempt to remove the watermark while preserving the text's content. Common removal techniques include:
- Paraphrasing: Using another LLM to rewrite the text. This replaces most tokens, disrupting the green-list signal.
- Token substitution: Replacing individual words with synonyms using a thesaurus or word embedding similarity.
- Homoglyph substitution: Replacing characters with visually identical Unicode characters from different scripts.
- Translation round-tripping: Translating to another language and back, which completely retokenizes the text.
- Insertion/deletion: Adding or removing words to shift the token alignment.
Forensic Detection of Removal Attempts
Even when a watermark has been partially removed, forensic evidence of the original watermark may persist. The investigator should look for:
def analyze_removal_artifacts(
token_ids: list[int],
vocab_size: int,
secret_key: bytes,
gamma: float = 0.5,
) -> dict:
"""
Analyze text for signs of watermark removal attempts.
Even after paraphrasing or editing, statistical artifacts of
the original watermark may remain in portions of the text
that were not modified.
"""
# Full-text detection
full_result = detect_watermark(token_ids, vocab_size, secret_key, gamma)
# Windowed detection to find surviving watermark fragments
window_results = windowed_watermark_detection(
token_ids, vocab_size, secret_key,
window_size=50, step_size=10, gamma=gamma,
)
# Count windows with significant watermark signal
significant_windows = [
w for w in window_results if w.get("z_score", 0) > 2.0
]
# Analyze the pattern of surviving vs. removed watermark regions
z_scores = [w.get("z_score", 0) for w in window_results]
z_variance = float(np.var(z_scores)) if z_scores else 0.0
return {
"full_text_detection": full_result,
"total_windows": len(window_results),
"significant_windows": len(significant_windows),
"significant_fraction": (
len(significant_windows) / len(window_results)
if window_results else 0.0
),
"z_score_variance": z_variance,
"removal_likely": (
full_result.get("z_score", 0) < 2.0
and len(significant_windows) > len(window_results) * 0.2
),
"interpretation": _interpret_removal_analysis(
full_result, significant_windows, window_results, z_variance
),
}
def _interpret_removal_analysis(
full_result: dict,
significant_windows: list,
all_windows: list,
z_variance: float,
) -> str:
full_z = full_result.get("z_score", 0)
sig_frac = len(significant_windows) / max(len(all_windows), 1)
if full_z > 4.0:
return "Strong watermark detected -- no significant removal"
if full_z > 2.0 and sig_frac > 0.5:
return "Partial watermark -- possible light editing or paraphrasing of some sections"
if full_z < 2.0 and sig_frac > 0.2:
return "Watermark largely removed but fragments survive -- likely systematic removal attempt"
if z_variance > 4.0:
return "High variance in local watermark strength -- mixed human/AI or selective editing"
return "No watermark signal detected"Provider-Specific Watermarking Schemes
Different AI providers have implemented or announced different watermarking approaches. Forensic investigators should be aware of the landscape:
| Provider | Watermarking Status | Scheme Type | Detection Availability |
|---|---|---|---|
| OpenAI | Announced, delayed deployment | Unknown (not publicly detailed) | Not publicly available |
| Google DeepMind | SynthID deployed | Logit-based statistical | Integrated in Google tools |
| Meta | Research published | Green-list variant | Open-source reference implementation |
| Anthropic | No public watermarking announced | N/A | N/A |
| Microsoft | Research published | Multiple schemes studied | Research code available |
Google's SynthID system is the most widely deployed production watermarking scheme as of early 2026. It applies to both text and image outputs from Google's AI models and uses a tournament-based detection scheme that is more robust to text modifications than the basic green-list approach.
Practical Considerations
Text Length Requirements
Watermark detection is a statistical test whose power increases with sample size. Short texts (under 100 tokens) are generally too short for reliable detection. The investigator should request the longest available sample of suspected AI-generated text and avoid truncation.
Language and Domain Effects
Watermark strength varies by language and domain. Text in domains with highly constrained vocabularies (legal, medical, code) has fewer opportunities for the model to choose green-list tokens without degrading quality, resulting in weaker watermark signals. Forensic thresholds should be adjusted accordingly.
Tokenizer Sensitivity
The watermark signal exists at the token level, not the character or word level. Any text processing that changes tokenization boundaries (such as reformatting, changing whitespace, or correcting punctuation) can weaken the watermark signal even without intentional removal.
References
- Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A Watermark for Large Language Models. Proceedings of the 40th International Conference on Machine Learning (ICML). https://arxiv.org/abs/2301.10226
- Aaronson, S. (2023). My AI Safety Lecture at UT Austin. Scott Aaronson's Blog (Shtetl-Optimized). https://scottaaronson.blog/?p=6823
- Zhao, X., Ananth, P., Li, L., & Wang, Y.-X. (2024). Provable Robust Watermarking for AI-Generated Text. Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2306.17439