AI Watermarking and Attacks

advanced9 min readUpdated 2026-03-15

Current AI watermarking schemes for model outputs and training data, their security properties, and known attacks that remove, forge, or evade watermarks.

watermarking provenance detection attacks text-watermark

AI Watermarking and Attacks

AI watermarking aims to solve two critical problems: detecting whether content was generated by AI and attributing that content to a specific model or provider. As AI-generated content becomes indistinguishable from human-created content, watermarking has emerged as a primary technical approach for maintaining provenance and trust. However, every watermarking scheme must contend with fundamental trade-offs between detectability, robustness, and quality -- and adversaries actively exploit these trade-offs.

Text Watermarking Schemes

Green-Red Token Watermarking (Kirchenbauer et al.)

The most influential text watermarking approach partitions the vocabulary into "green" and "red" lists based on a hash of the preceding tokens, then biases generation toward green tokens:

import hashlib
import torch
 
class GreenRedWatermark:
    """Kirchenbauer et al. green-red list watermarking."""
 
    def __init__(self, vocab_size, green_fraction=0.5,
                 watermark_strength=2.0, secret_key="key"):
        self.vocab_size = vocab_size
        self.green_fraction = green_fraction
        self.strength = watermark_strength  # delta parameter
        self.key = secret_key
 
    def get_green_tokens(self, prefix_token_id):
        """Compute green list based on preceding token."""
        seed = hashlib.sha256(
            f"{self.key}:{prefix_token_id}".encode()
        ).digest()
        rng = torch.Generator().manual_seed(
            int.from_bytes(seed[:4], "big")
        )
 
        # Random permutation of vocabulary
        perm = torch.randperm(self.vocab_size, generator=rng)
 
        # First green_fraction of permutation are green tokens
        green_count = int(self.vocab_size * self.green_fraction)
        green_tokens = set(perm[:green_count].tolist())
        return green_tokens
 
    def apply_watermark(self, logits, prefix_token_id):
        """Bias logits toward green tokens."""
        green_tokens = self.get_green_tokens(prefix_token_id)
 
        watermarked_logits = logits.clone()
        for token_id in green_tokens:
            watermarked_logits[token_id] += self.strength
 
        return watermarked_logits
 
    def detect_watermark(self, token_ids, threshold=4.0):
        """Detect watermark by counting green token frequency."""
        green_count = 0
        total = 0
 
        for i in range(1, len(token_ids)):
            green_tokens = self.get_green_tokens(token_ids[i - 1])
            if token_ids[i] in green_tokens:
                green_count += 1
            total += 1
 
        if total == 0:
            return {"detected": False, "z_score": 0}
 
        # Compute z-score under null hypothesis
        # (random text has green_fraction green tokens)
        expected = total * self.green_fraction
        std = (total * self.green_fraction * (1 - self.green_fraction)) ** 0.5
        z_score = (green_count - expected) / std
 
        return {
            "detected": z_score > threshold,
            "z_score": z_score,
            "green_fraction_observed": green_count / total,
            "green_fraction_expected": self.green_fraction
        }

Distortion-Free Watermarking (Christ et al.)

This approach embeds watermarks without any quality degradation by using rejection sampling from the model's original distribution:

class DistortionFreeWatermark:
    """
    Watermark that provably does not alter output distribution.
    Uses shared randomness between encoder and decoder.
    """
 
    def __init__(self, secret_key):
        self.key = secret_key
 
    def generate_watermarked(self, model, prompt, max_tokens=100):
        """Generate watermarked text using Gumbel softmax trick."""
        tokens = []
        current_input = prompt
 
        for step in range(max_tokens):
            # Get model's probability distribution
            logits = model(current_input)
            probs = torch.softmax(logits[-1], dim=-1)
 
            # Generate pseudorandom Gumbel noise (keyed by position)
            seed = self.position_seed(step, tokens)
            gumbel_noise = self.sample_gumbel(probs.shape, seed)
 
            # Argmax of log(probs) + gumbel_noise is equivalent
            # to sampling from probs (Gumbel-max trick)
            # But the specific noise is reproducible with the key
            selected = (probs.log() + gumbel_noise).argmax()
 
            tokens.append(selected.item())
            current_input = torch.cat([current_input, selected.unsqueeze(0)])
 
        return tokens
 
    def detect(self, tokens, model, prompt):
        """Detect watermark by checking if tokens match expected noise."""
        log_likelihood_ratio = 0
 
        for step, token in enumerate(tokens):
            logits = model(prompt)  # Need to re-run model
            probs = torch.softmax(logits[-1], dim=-1)
 
            seed = self.position_seed(step, tokens[:step])
            gumbel_noise = self.sample_gumbel(probs.shape, seed)
 
            expected_token = (probs.log() + gumbel_noise).argmax()
 
            if token == expected_token.item():
                log_likelihood_ratio += 1
 
        return log_likelihood_ratio / len(tokens)

Watermark Attacks

Paraphrase Attack

The simplest and most effective attack: rewrite the watermarked text using another model:

def paraphrase_attack(watermarked_text, paraphraser_model):
    """
    Remove watermark by paraphrasing with a non-watermarked model.
    Effectiveness depends on paraphrase quality.
    """
    prompt = f"""Rewrite the following text to convey the exact same
meaning using completely different wording and sentence structure.
Maintain the same level of detail and technical accuracy.
 
Original: {watermarked_text}
 
Rewritten version:"""
 
    return paraphraser_model.generate(prompt)

Token Substitution Attack

Systematically replace tokens to reduce green token frequency while preserving meaning:

def token_substitution_attack(text, tokenizer, synonym_dict,
                                substitution_rate=0.3):
    """
    Replace a fraction of tokens with synonyms to disrupt watermark.
    Lower substitution rates preserve quality but may leave
    detectable watermark signal.
    """
    tokens = tokenizer.tokenize(text)
    modified = []
 
    for i, token in enumerate(tokens):
        if (random.random() < substitution_rate and
            token in synonym_dict):
            # Replace with random synonym
            synonym = random.choice(synonym_dict[token])
            modified.append(synonym)
        else:
            modified.append(token)
 
    return tokenizer.detokenize(modified)

Emoji and Unicode Attack

Insert zero-width characters or Unicode variations to disrupt token-level watermark detection:

def unicode_disruption_attack(text):
    """
    Insert zero-width characters between tokens to disrupt
    watermark detection that depends on token boundaries.
    """
    zero_width_chars = [
        '\u200b',  # Zero-width space
        '\u200c',  # Zero-width non-joiner
        '\u200d',  # Zero-width joiner
        '\ufeff',  # Zero-width no-break space
    ]
 
    words = text.split()
    disrupted = []
 
    for word in words:
        # Randomly insert zero-width character within word
        if len(word) > 3 and random.random() > 0.5:
            pos = random.randint(1, len(word) - 1)
            zwc = random.choice(zero_width_chars)
            word = word[:pos] + zwc + word[pos:]
        disrupted.append(word)
 
    return ' '.join(disrupted)

Spoofing Attack (Watermark Forgery)

If the watermarking key or algorithm is discovered, an attacker can embed false watermarks in human-written text:

def watermark_spoofing(human_text, watermark_key, tokenizer,
                        green_fraction=0.5):
    """
    Add a false watermark to human-written text to frame it
    as AI-generated, or to attribute it to a specific model.
    """
    tokens = tokenizer.encode(human_text)
    watermarker = GreenRedWatermark(
        tokenizer.vocab_size, green_fraction,
        secret_key=watermark_key
    )
 
    spoofed_tokens = []
    for i, token in enumerate(tokens):
        if i == 0:
            spoofed_tokens.append(token)
            continue
 
        green_tokens = watermarker.get_green_tokens(spoofed_tokens[-1])
 
        if token in green_tokens:
            # Already green -- keep it
            spoofed_tokens.append(token)
        else:
            # Find a green synonym
            synonyms = find_synonyms(token, tokenizer)
            green_synonyms = [s for s in synonyms if s in green_tokens]
            if green_synonyms:
                spoofed_tokens.append(random.choice(green_synonyms))
            else:
                spoofed_tokens.append(token)
 
    return tokenizer.decode(spoofed_tokens)

Recursive Watermark Removal

Use multiple models in sequence to progressively remove watermark signal:

def recursive_removal(text, models, rounds=3):
    """
    Pass text through multiple non-watermarked models
    to progressively destroy watermark signal.
    """
    current = text
 
    for round_num in range(rounds):
        model = models[round_num % len(models)]
 
        current = model.generate(
            f"Faithfully reproduce the following text with minor "
            f"stylistic improvements:\n\n{current}"
        )
 
    return current

Image Watermarking and Attacks

Stable Signature (Fernandez et al.)

Embeds watermarks by fine-tuning the decoder of a latent diffusion model:

class StableSignatureAttack:
    """Attacks against Stable Signature image watermarking."""
 
    def regeneration_attack(self, watermarked_image, clean_model):
        """
        Remove watermark by encoding image to latent space
        with a clean (non-watermarked) model, then decoding.
        """
        # Encode with any VAE encoder
        latent = clean_model.encoder(watermarked_image)
 
        # Add small noise to disrupt watermark in latent space
        noisy_latent = latent + torch.randn_like(latent) * 0.05
 
        # Decode with clean decoder (no watermark)
        cleaned_image = clean_model.decoder(noisy_latent)
 
        return cleaned_image
 
    def adversarial_perturbation_attack(self, watermarked_image,
                                          detector, epsilon=0.01):
        """
        Add imperceptible perturbation that causes detector
        to miss the watermark.
        """
        image = watermarked_image.clone().requires_grad_(True)
        optimizer = torch.optim.Adam([image], lr=0.001)
 
        for _ in range(100):
            optimizer.zero_grad()
 
            # Minimize detector confidence
            detection_score = detector(image)
            loss = detection_score  # Want to minimize
 
            loss.backward()
            optimizer.step()
 
            # Project to epsilon ball
            delta = image.data - watermarked_image
            delta = torch.clamp(delta, -epsilon, epsilon)
            image.data = watermarked_image + delta
 
        return image.detach()

Watermark Robustness Analysis

Comprehensive analysis of watermark robustness across attack types:

Watermark Type	Paraphrase	Token Sub.	Spoofing	Quality Loss
Green-Red (soft)	Vulnerable	Partially robust	Vulnerable if key leaked	Minimal
Green-Red (hard)	Vulnerable	Vulnerable	Vulnerable if key leaked	Moderate
Distortion-free	Vulnerable	Vulnerable	Resistant	None
Semantic watermark	Partially robust	Robust	Hard to spoof	Variable
Image (Stable Sig.)	N/A	N/A	Requires model access	Minimal

Implications for Trust and Governance

Watermarking occupies a central role in proposed AI governance frameworks, but its limitations have significant policy implications:

False sense of security: Watermarks can be removed, creating false negatives (AI content not detected as AI)
False accusations: Watermarks can be spoofed, creating false positives (human content flagged as AI)
Arms race dynamics: As watermarking improves, so do attacks, creating an unstable equilibrium
Coverage gaps: Not all AI providers watermark their outputs, and open-source models cannot be forced to implement watermarking

Watermark & Fingerprint Evasion — Basic watermark evasion concepts
Data Provenance — Tracking data through ML pipelines
Model Extraction — Extracting models and bypassing IP protections

Knowledge Check

A regulatory agency proposes requiring all AI-generated text to carry watermarks. An adversary uses a non-watermarked open-source model to paraphrase watermarked text. What is the primary implication?

References

Kirchenbauer et al., "A Watermark for Large Language Models" (2023)
Christ et al., "Undetectable Watermarks for Language Models" (2023)
Fernandez et al., "The Stable Signature: Rooting Watermarks in Latent Diffusion Models" (2023)
Zhao et al., "Provable Robust Watermarking for AI-Generated Text" (2023)
Sadasivan et al., "Can AI-Generated Text be Reliably Detected?" (2023)

Edit this page on GitHub

AI Watermarking and Attacks

advanced9 min readUpdated 2026-03-15

Current AI watermarking schemes for model outputs and training data, their security properties, and known attacks that remove, forge, or evade watermarks.

watermarking provenance detection attacks text-watermark

AI Watermarking and Attacks

Text Watermarking Schemes

Green-Red Token Watermarking (Kirchenbauer et al.)

The most influential text watermarking approach partitions the vocabulary into "green" and "red" lists based on a hash of the preceding tokens, then biases generation toward green tokens:

import hashlib
import torch
 
class GreenRedWatermark:
    """Kirchenbauer et al. green-red list watermarking."""
 
    def __init__(self, vocab_size, green_fraction=0.5,
                 watermark_strength=2.0, secret_key="key"):
        self.vocab_size = vocab_size
        self.green_fraction = green_fraction
        self.strength = watermark_strength  # delta parameter
        self.key = secret_key
 
    def get_green_tokens(self, prefix_token_id):
        """Compute green list based on preceding token."""
        seed = hashlib.sha256(
            f"{self.key}:{prefix_token_id}".encode()
        ).digest()
        rng = torch.Generator().manual_seed(
            int.from_bytes(seed[:4], "big")
        )
 
        # Random permutation of vocabulary
        perm = torch.randperm(self.vocab_size, generator=rng)
 
        # First green_fraction of permutation are green tokens
        green_count = int(self.vocab_size * self.green_fraction)
        green_tokens = set(perm[:green_count].tolist())
        return green_tokens
 
    def apply_watermark(self, logits, prefix_token_id):
        """Bias logits toward green tokens."""
        green_tokens = self.get_green_tokens(prefix_token_id)
 
        watermarked_logits = logits.clone()
        for token_id in green_tokens:
            watermarked_logits[token_id] += self.strength
 
        return watermarked_logits
 
    def detect_watermark(self, token_ids, threshold=4.0):
        """Detect watermark by counting green token frequency."""
        green_count = 0
        total = 0
 
        for i in range(1, len(token_ids)):
            green_tokens = self.get_green_tokens(token_ids[i - 1])
            if token_ids[i] in green_tokens:
                green_count += 1
            total += 1
 
        if total == 0:
            return {"detected": False, "z_score": 0}
 
        # Compute z-score under null hypothesis
        # (random text has green_fraction green tokens)
        expected = total * self.green_fraction
        std = (total * self.green_fraction * (1 - self.green_fraction)) ** 0.5
        z_score = (green_count - expected) / std
 
        return {
            "detected": z_score > threshold,
            "z_score": z_score,
            "green_fraction_observed": green_count / total,
            "green_fraction_expected": self.green_fraction
        }

Distortion-Free Watermarking (Christ et al.)

This approach embeds watermarks without any quality degradation by using rejection sampling from the model's original distribution:

class DistortionFreeWatermark:
    """
    Watermark that provably does not alter output distribution.
    Uses shared randomness between encoder and decoder.
    """
 
    def __init__(self, secret_key):
        self.key = secret_key
 
    def generate_watermarked(self, model, prompt, max_tokens=100):
        """Generate watermarked text using Gumbel softmax trick."""
        tokens = []
        current_input = prompt
 
        for step in range(max_tokens):
            # Get model's probability distribution
            logits = model(current_input)
            probs = torch.softmax(logits[-1], dim=-1)
 
            # Generate pseudorandom Gumbel noise (keyed by position)
            seed = self.position_seed(step, tokens)
            gumbel_noise = self.sample_gumbel(probs.shape, seed)
 
            # Argmax of log(probs) + gumbel_noise is equivalent
            # to sampling from probs (Gumbel-max trick)
            # But the specific noise is reproducible with the key
            selected = (probs.log() + gumbel_noise).argmax()
 
            tokens.append(selected.item())
            current_input = torch.cat([current_input, selected.unsqueeze(0)])
 
        return tokens
 
    def detect(self, tokens, model, prompt):
        """Detect watermark by checking if tokens match expected noise."""
        log_likelihood_ratio = 0
 
        for step, token in enumerate(tokens):
            logits = model(prompt)  # Need to re-run model
            probs = torch.softmax(logits[-1], dim=-1)
 
            seed = self.position_seed(step, tokens[:step])
            gumbel_noise = self.sample_gumbel(probs.shape, seed)
 
            expected_token = (probs.log() + gumbel_noise).argmax()
 
            if token == expected_token.item():
                log_likelihood_ratio += 1
 
        return log_likelihood_ratio / len(tokens)

Watermark Attacks

Paraphrase Attack

The simplest and most effective attack: rewrite the watermarked text using another model:

def paraphrase_attack(watermarked_text, paraphraser_model):
    """
    Remove watermark by paraphrasing with a non-watermarked model.
    Effectiveness depends on paraphrase quality.
    """
    prompt = f"""Rewrite the following text to convey the exact same
meaning using completely different wording and sentence structure.
Maintain the same level of detail and technical accuracy.
 
Original: {watermarked_text}
 
Rewritten version:"""
 
    return paraphraser_model.generate(prompt)

Token Substitution Attack

Systematically replace tokens to reduce green token frequency while preserving meaning:

def token_substitution_attack(text, tokenizer, synonym_dict,
                                substitution_rate=0.3):
    """
    Replace a fraction of tokens with synonyms to disrupt watermark.
    Lower substitution rates preserve quality but may leave
    detectable watermark signal.
    """
    tokens = tokenizer.tokenize(text)
    modified = []
 
    for i, token in enumerate(tokens):
        if (random.random() < substitution_rate and
            token in synonym_dict):
            # Replace with random synonym
            synonym = random.choice(synonym_dict[token])
            modified.append(synonym)
        else:
            modified.append(token)
 
    return tokenizer.detokenize(modified)

Emoji and Unicode Attack

Insert zero-width characters or Unicode variations to disrupt token-level watermark detection:

def unicode_disruption_attack(text):
    """
    Insert zero-width characters between tokens to disrupt
    watermark detection that depends on token boundaries.
    """
    zero_width_chars = [
        '\u200b',  # Zero-width space
        '\u200c',  # Zero-width non-joiner
        '\u200d',  # Zero-width joiner
        '\ufeff',  # Zero-width no-break space
    ]
 
    words = text.split()
    disrupted = []
 
    for word in words:
        # Randomly insert zero-width character within word
        if len(word) > 3 and random.random() > 0.5:
            pos = random.randint(1, len(word) - 1)
            zwc = random.choice(zero_width_chars)
            word = word[:pos] + zwc + word[pos:]
        disrupted.append(word)
 
    return ' '.join(disrupted)

Spoofing Attack (Watermark Forgery)

If the watermarking key or algorithm is discovered, an attacker can embed false watermarks in human-written text:

def watermark_spoofing(human_text, watermark_key, tokenizer,
                        green_fraction=0.5):
    """
    Add a false watermark to human-written text to frame it
    as AI-generated, or to attribute it to a specific model.
    """
    tokens = tokenizer.encode(human_text)
    watermarker = GreenRedWatermark(
        tokenizer.vocab_size, green_fraction,
        secret_key=watermark_key
    )
 
    spoofed_tokens = []
    for i, token in enumerate(tokens):
        if i == 0:
            spoofed_tokens.append(token)
            continue
 
        green_tokens = watermarker.get_green_tokens(spoofed_tokens[-1])
 
        if token in green_tokens:
            # Already green -- keep it
            spoofed_tokens.append(token)
        else:
            # Find a green synonym
            synonyms = find_synonyms(token, tokenizer)
            green_synonyms = [s for s in synonyms if s in green_tokens]
            if green_synonyms:
                spoofed_tokens.append(random.choice(green_synonyms))
            else:
                spoofed_tokens.append(token)
 
    return tokenizer.decode(spoofed_tokens)

Recursive Watermark Removal

Use multiple models in sequence to progressively remove watermark signal:

def recursive_removal(text, models, rounds=3):
    """
    Pass text through multiple non-watermarked models
    to progressively destroy watermark signal.
    """
    current = text
 
    for round_num in range(rounds):
        model = models[round_num % len(models)]
 
        current = model.generate(
            f"Faithfully reproduce the following text with minor "
            f"stylistic improvements:\n\n{current}"
        )
 
    return current

Image Watermarking and Attacks

Stable Signature (Fernandez et al.)

Embeds watermarks by fine-tuning the decoder of a latent diffusion model:

class StableSignatureAttack:
    """Attacks against Stable Signature image watermarking."""
 
    def regeneration_attack(self, watermarked_image, clean_model):
        """
        Remove watermark by encoding image to latent space
        with a clean (non-watermarked) model, then decoding.
        """
        # Encode with any VAE encoder
        latent = clean_model.encoder(watermarked_image)
 
        # Add small noise to disrupt watermark in latent space
        noisy_latent = latent + torch.randn_like(latent) * 0.05
 
        # Decode with clean decoder (no watermark)
        cleaned_image = clean_model.decoder(noisy_latent)
 
        return cleaned_image
 
    def adversarial_perturbation_attack(self, watermarked_image,
                                          detector, epsilon=0.01):
        """
        Add imperceptible perturbation that causes detector
        to miss the watermark.
        """
        image = watermarked_image.clone().requires_grad_(True)
        optimizer = torch.optim.Adam([image], lr=0.001)
 
        for _ in range(100):
            optimizer.zero_grad()
 
            # Minimize detector confidence
            detection_score = detector(image)
            loss = detection_score  # Want to minimize
 
            loss.backward()
            optimizer.step()
 
            # Project to epsilon ball
            delta = image.data - watermarked_image
            delta = torch.clamp(delta, -epsilon, epsilon)
            image.data = watermarked_image + delta
 
        return image.detach()

Watermark Robustness Analysis

Comprehensive analysis of watermark robustness across attack types:

Watermark Type	Paraphrase	Token Sub.	Spoofing	Quality Loss
Green-Red (soft)	Vulnerable	Partially robust	Vulnerable if key leaked	Minimal
Green-Red (hard)	Vulnerable	Vulnerable	Vulnerable if key leaked	Moderate
Distortion-free	Vulnerable	Vulnerable	Resistant	None
Semantic watermark	Partially robust	Robust	Hard to spoof	Variable
Image (Stable Sig.)	N/A	N/A	Requires model access	Minimal

Implications for Trust and Governance

Watermarking occupies a central role in proposed AI governance frameworks, but its limitations have significant policy implications:

False sense of security: Watermarks can be removed, creating false negatives (AI content not detected as AI)
False accusations: Watermarks can be spoofed, creating false positives (human content flagged as AI)
Arms race dynamics: As watermarking improves, so do attacks, creating an unstable equilibrium
Coverage gaps: Not all AI providers watermark their outputs, and open-source models cannot be forced to implement watermarking

Watermark & Fingerprint Evasion — Basic watermark evasion concepts
Data Provenance — Tracking data through ML pipelines
Model Extraction — Extracting models and bypassing IP protections

Knowledge Check

References

Kirchenbauer et al., "A Watermark for Large Language Models" (2023)
Christ et al., "Undetectable Watermarks for Language Models" (2023)
Fernandez et al., "The Stable Signature: Rooting Watermarks in Latent Diffusion Models" (2023)
Zhao et al., "Provable Robust Watermarking for AI-Generated Text" (2023)
Sadasivan et al., "Can AI-Generated Text be Reliably Detected?" (2023)

Edit this page on GitHub

AI Watermarking and Attacks

Related articles

AI Watermarking and Attacks

Related articles