Watermarking & AI-Generated Text Detection

expert10 min readUpdated 2026-03-13

Statistical watermarking schemes for LLM outputs, AI-generated text detectors, their cryptographic foundations, and systematic techniques for evading or removing watermarks.

watermarking detection ai-generated

Watermarking and AI-generated text detection represent the defender's attempt to maintain provenance over model outputs. For red teamers, understanding these mechanisms is essential -- both to assess their robustness and to evaluate whether downstream security assumptions built on watermark presence are sound.

Watermarking Scheme Taxonomy

Scheme Type	Mechanism	Robustness	Quality Impact	Detection Complexity
Token-level (Kirchenbauer et al.)	Green/red list partitioning per token position	Moderate -- vulnerable to paraphrasing	Low at moderate bias (delta ~1-2)	O(n) with secret key
Distribution-shift	Subtle bias across full output distribution	Higher -- survives light editing	Very low	Requires statistical test
Semantic watermarking	Embeds signal in meaning rather than specific tokens	Highest -- survives paraphrasing	Moderate	Requires embedding model
Multi-bit	Encodes payload (user ID, timestamp) in watermark	Varies	Moderate -- more bits = more distortion	Requires key + decoder
Unbiased (Christ et al.)	Perfectly preserves output distribution	High theoretical guarantees	None (provably unbiased)	Requires full key

How Token-Level Watermarking Works

The most widely deployed scheme (based on Kirchenbauer et al., 2023) operates during inference:

import hashlib
import numpy as np
 
class WatermarkLogitsProcessor:
    """Simplified watermark injection during LLM inference."""
 
    def __init__(self, secret_key: bytes, gamma: float = 0.5, delta: float = 2.0):
        self.secret_key = secret_key
        self.gamma = gamma  # fraction of vocab in green list
        self.delta = delta  # logit bias added to green-list tokens
 
    def get_green_list(self, prev_token_id: int, vocab_size: int) -> set:
        """Deterministically partition vocab into green/red lists."""
        seed = hashlib.sha256(
            self.secret_key + prev_token_id.to_bytes(4, "big")
        ).digest()
        rng = np.random.RandomState(
            int.from_bytes(seed[:4], "big")
        )
        green_count = int(vocab_size * self.gamma)
        green_list = set(rng.choice(vocab_size, green_count, replace=False))
        return green_list
 
    def __call__(self, prev_token_id: int, logits: np.ndarray) -> np.ndarray:
        """Add delta bias to green-list token logits."""
        green_list = self.get_green_list(prev_token_id, len(logits))
        modified = logits.copy()
        for token_id in green_list:
            modified[token_id] += self.delta
        return modified

Detection Algorithm

from scipy import stats
 
def detect_watermark(
    text_token_ids: list[int],
    secret_key: bytes,
    vocab_size: int,
    gamma: float = 0.5,
) -> dict:
    """Test whether text contains a statistical watermark."""
    green_count = 0
    total = len(text_token_ids) - 1  # skip first token (no predecessor)
 
    for i in range(1, len(text_token_ids)):
        prev_id = text_token_ids[i - 1]
        green_list = get_green_list(prev_id, vocab_size, secret_key, gamma)
        if text_token_ids[i] in green_list:
            green_count += 1
 
    # Under null hypothesis (no watermark), green fraction ~ gamma
    z_score = (green_count - gamma * total) / np.sqrt(total * gamma * (1 - gamma))
    p_value = 1 - stats.norm.cdf(z_score)
 
    return {
        "green_fraction": green_count / total,
        "z_score": z_score,
        "p_value": p_value,
        "watermarked": z_score > 4.0,  # threshold
    }

AI-Generated Text Detection (Non-Watermark)

Beyond watermarking, statistical detectors attempt to identify AI-generated text without cooperation from the generating model.

Detector Categories

Detector Type	Approach	Strengths	Weaknesses
Perplexity-based	AI text has lower perplexity under a reference model	Simple, no training needed	High false-positive rate on formulaic human text
Trained classifiers	ML model trained on human vs. AI text pairs	Can capture subtle distributional patterns	Brittle to domain shift, new models
Zero-shot (DetectGPT)	Perturbation-based: AI text occupies local curvature maxima in log-probability space	No training data needed	Computationally expensive, less reliable on short text
Stylometric	Analyzes writing style features (sentence length variance, vocabulary richness)	Model-agnostic	Easily defeated by style-transfer

DetectGPT Mechanism

def detect_gpt_score(
    text: str,
    scoring_model,
    mask_model,
    n_perturbations: int = 100,
) -> float:
    """
    DetectGPT: AI text sits at local maxima of log-probability.
    Perturbations should decrease log-prob for AI text,
    but have random effect on human text.
    """
    original_ll = scoring_model.log_likelihood(text)
    perturbation_lls = []
 
    for _ in range(n_perturbations):
        # Mask random spans and fill with mask_model (e.g., T5)
        perturbed = mask_and_fill(text, mask_model)
        perturbation_lls.append(scoring_model.log_likelihood(perturbed))
 
    mean_perturbed_ll = np.mean(perturbation_lls)
    std_perturbed_ll = np.std(perturbation_lls)
 
    # Normalized perturbation discrepancy
    score = (original_ll - mean_perturbed_ll) / std_perturbed_ll
    return score  # high score -> likely AI-generated

Red Team Evasion Techniques

Watermark Removal

The most effective general-purpose attack. A second (unwatermarked) model paraphrases the watermarked output, destroying the token-level statistical signal while preserving meaning.

def paraphrase_attack(watermarked_text: str, paraphraser) -> str:
    """Use a separate model to remove watermark via paraphrasing."""
    prompt = f"""Rewrite the following text to convey the same meaning
using different words and sentence structures. Preserve all
factual content but change the phrasing completely.
 
Text: {watermarked_text}
 
Rewritten:"""
    return paraphraser.generate(prompt)

Effectiveness: removes token-level watermarks with >95% success. Semantic watermarks are more resistant.

Systematically replace tokens with synonyms at a rate sufficient to break the green-list bias without changing meaning.

def synonym_substitution_attack(
    text: str,
    substitution_rate: float = 0.25,
    synonym_dict: dict = None,
) -> str:
    """Replace ~25% of content words with synonyms."""
    words = text.split()
    for i, word in enumerate(words):
        if random.random() < substitution_rate and word.lower() in synonym_dict:
            words[i] = random.choice(synonym_dict[word.lower()])
    return " ".join(words)

Insert zero-width characters or homoglyphs between tokens to break the predecessor-based green list computation.

ZERO_WIDTH_CHARS = ["\u200b", "\u200c", "\u200d", "\ufeff"]
 
def unicode_injection_attack(text: str, injection_rate: float = 0.15) -> str:
    """Insert zero-width characters to disrupt token-chain watermark."""
    chars = list(text)
    positions = random.sample(range(len(chars)), int(len(chars) * injection_rate))
    for pos in sorted(positions, reverse=True):
        chars.insert(pos, random.choice(ZERO_WIDTH_CHARS))
    return "".join(chars)

Use the watermarked text as context for a non-watermarked model to regenerate equivalent content.

def cross_model_transfer(watermarked_text: str, clean_model) -> str:
    """Transfer content through a non-watermarked model."""
    prompt = f"""Based on the following information, write a comprehensive
response covering the same topics and conclusions:
 
{watermarked_text}"""
    return clean_model.generate(prompt)

Detector Evasion

Evasion Technique	Target Detector	Mechanism
Recursive paraphrasing	All statistical	Multiple rounds of rewriting compound noise
Adversarial perturbation	Trained classifiers	Gradient-based token substitution to minimize classifier confidence
Human-AI co-writing	Perplexity-based	Mix human and AI text to raise perplexity variance
Style transfer	Stylometric	Rewrite in a specific human author's style
Back-translation	Token-level watermark	Translate to another language and back

Robustness Assessment Framework

When evaluating watermarking or detection systems during a red team engagement, use this systematic approach:

Baseline Measurement
Generate 100+ watermarked samples and verify detection rate. Establish the z-score distribution for confirmed watermarked text versus clean text.
Minimal Perturbation Testing
Apply increasing levels of perturbation (5%, 10%, 25%, 50% token replacement) and measure detection degradation. Find the threshold where detection fails.
Semantic Preservation Verification
For each successful evasion technique, verify that the semantic content is preserved using embedding similarity (cosine similarity > 0.90 on the original vs. evaded text).
False Positive Analysis
Run the detector on human-written text from the target domain. Report false positive rates by demographic and writing style, especially for non-native English speakers.
Adaptive Attack
If the watermark scheme uses a known architecture, design a targeted attack that exploits its specific weakness (e.g., known green-list structure, fixed window size).

Fundamental Limitations

Limitation	Implication for Red Teams
Quality-robustness trade-off	Aggressive watermarks degrade output -- look for quality differences as detection signal
Key management	Compromised keys allow forgery of watermarks or targeted removal
Short text unreliability	Watermarks require ~200+ tokens for reliable detection
Multi-model pipelines	Text passing through multiple models loses watermark provenance
Fine-tuning removal	Fine-tuning a watermarked model can remove the watermarking behavior

For related defensive techniques, see Input/Output Filtering Systems and Guardrails Architecture.

Advanced Defense Techniques - Other cutting-edge defense research including instruction hierarchy and representation engineering
Input/Output Filtering Systems - Complementary output scanning approaches
Guardrails & Safety Layer Architecture - Where watermarking fits in the defense pipeline
Inference: Sampling, Temperature & Generation - Decoding mechanics that watermarking modifies

References

"A Watermark for Large Language Models" - Kirchenbauer et al. (2023) - The foundational paper introducing token-level green/red list watermarking for LLMs
"Unbiased Watermark for Large Language Models" - Christ et al. (2024) - Research on provably unbiased watermarking that preserves output distribution quality
"DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature" - Mitchell et al. (2023) - The perturbation-based AI text detection method using log-probability analysis
"On the Reliability of Watermarks for Large Language Models" - Sadasivan et al. (2023) - Critical analysis of watermark robustness against paraphrase and editing attacks
"Can AI-Generated Text be Reliably Detected?" - Sadasivan et al. (2023) - Research on fundamental limitations of AI text detection including false positive rates

Knowledge Check

What is the most effective general-purpose attack against token-level watermarking?

Watermarking & AI-Generated Text Detection

Baseline Measurement

Minimal Perturbation Testing

Semantic Preservation Verification

False Positive Analysis

Adaptive Attack

Related articles

Watermarking & AI-Generated Text Detection

Baseline Measurement

Minimal Perturbation Testing

Semantic Preservation Verification

False Positive Analysis

Adaptive Attack

Related articles