Lab: AI Watermark Detection & Removal

expert9 min readUpdated 2026-03-13

Hands-on lab exploring techniques for detecting and removing statistical watermarks embedded in AI-generated text, and evaluating watermark robustness.

lab expert watermarking detection hands-on

Prerequisites

Understanding of language model token sampling (temperature, top-k, top-p)
Familiarity with statistical hypothesis testing
Python with NumPy, SciPy, and Hugging Face Transformers
Background reading on the Kirchenbauer et al. "green list" watermarking scheme

AI watermarking works by partitioning the vocabulary into "green" and "red" lists at each token position (based on a hash of the preceding token) and biasing generation toward green-list tokens. This bias is imperceptible to readers but statistically detectable by anyone who knows the hash key.

Lab Exercises

Implement a Basic Watermark Detector

Build a detector that can identify watermarked text by analyzing token distribution patterns.

#!/usr/bin/env python3
"""Detect statistical watermarks in AI-generated text."""
 
import hashlib
import numpy as np
from scipy import stats
from transformers import AutoTokenizer
 
class WatermarkDetector:
    def __init__(self, model_name: str = "meta-llama/Llama-3.1-8B-Instruct",
                 gamma: float = 0.5, hash_key: int = 42):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.gamma = gamma  # Proportion of vocab in green list
        self.hash_key = hash_key
        self.vocab_size = self.tokenizer.vocab_size
 
    def _get_green_list(self, prev_token_id: int) -> set[int]:
        """Generate the green list for a given previous token."""
        seed = hashlib.sha256(
            f"{self.hash_key}:{prev_token_id}".encode()
        ).digest()
        rng = np.random.RandomState(
            int.from_bytes(seed[:4], byteorder="big")
        )
        green_size = int(self.vocab_size * self.gamma)
        green_tokens = set(rng.choice(self.vocab_size, green_size, replace=False))
        return green_tokens
 
    def detect(self, text: str) -> dict:
        """Test whether text contains a watermark signal."""
        token_ids = self.tokenizer.encode(text)
 
        if len(token_ids) < 10:
            return {"watermarked": False, "reason": "Text too short"}
 
        green_count = 0
        total_scored = 0
 
        for i in range(1, len(token_ids)):
            prev_id = token_ids[i - 1]
            current_id = token_ids[i]
            green_list = self._get_green_list(prev_id)
 
            if current_id in green_list:
                green_count += 1
            total_scored += 1
 
        # Statistical test: under no watermark, green proportion ~ gamma
        green_proportion = green_count / total_scored
        z_score = (green_proportion - self.gamma) / np.sqrt(
            self.gamma * (1 - self.gamma) / total_scored
        )
        p_value = 1 - stats.norm.cdf(z_score)
 
        return {
            "watermarked": p_value < 0.01,
            "green_proportion": green_proportion,
            "expected_proportion": self.gamma,
            "z_score": z_score,
            "p_value": p_value,
            "tokens_analyzed": total_scored,
        }
 
if __name__ == "__main__":
    detector = WatermarkDetector()
 
    # Test with known AI-generated text
    ai_text = "The process of photosynthesis converts sunlight into energy..."
    result = detector.detect(ai_text)
    print(f"AI text: watermarked={result['watermarked']}, z={result['z_score']:.2f}")
 
    human_text = "I went to the store yesterday and bought some groceries."
    result = detector.detect(human_text)
    print(f"Human text: watermarked={result['watermarked']}, z={result['z_score']:.2f}")

python watermark_detector.py

Generate Watermarked Text for Testing

Create a watermarked text generator to produce test samples with known watermark signals.

#!/usr/bin/env python3
"""Generate watermarked text using green-list biasing."""
 
import hashlib
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
 
class WatermarkGenerator:
    def __init__(self, model_name: str = "meta-llama/Llama-3.1-8B-Instruct",
                 gamma: float = 0.5, delta: float = 2.0, hash_key: int = 42):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map="auto"
        )
        self.gamma = gamma
        self.delta = delta  # Bias strength added to green-list logits
        self.hash_key = hash_key
        self.vocab_size = self.tokenizer.vocab_size
 
    def _get_green_mask(self, prev_token_id: int) -> torch.Tensor:
        seed = hashlib.sha256(
            f"{self.hash_key}:{prev_token_id}".encode()
        ).digest()
        rng = np.random.RandomState(
            int.from_bytes(seed[:4], byteorder="big")
        )
        green_size = int(self.vocab_size * self.gamma)
        green_indices = rng.choice(self.vocab_size, green_size, replace=False)
        mask = torch.zeros(self.vocab_size)
        mask[green_indices] = self.delta
        return mask
 
    def generate(self, prompt: str, max_tokens: int = 200) -> str:
        input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to(
            self.model.device
        )
        generated = input_ids.clone()
 
        for _ in range(max_tokens):
            with torch.no_grad():
                outputs = self.model(generated)
            logits = outputs.logits[0, -1, :]
 
            prev_token = generated[0, -1].item()
            green_mask = self._get_green_mask(prev_token).to(logits.device)
            logits = logits + green_mask
 
            probs = torch.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, 1)
            generated = torch.cat([generated, next_token.unsqueeze(0)], dim=-1)
 
            if next_token.item() == self.tokenizer.eos_token_id:
                break
 
        new_tokens = generated[0, input_ids.shape[1]:]
        return self.tokenizer.decode(new_tokens, skip_special_tokens=True)
 
if __name__ == "__main__":
    gen = WatermarkGenerator()
    text = gen.generate("Explain the benefits of renewable energy:")
    print(f"Watermarked text:\n{text}")

Apply Watermark Removal Techniques

Test various text transformation methods that aim to remove the watermark while preserving semantic content.

#!/usr/bin/env python3
"""Techniques for removing watermarks from AI-generated text."""
 
from watermark_detector import WatermarkDetector
 
class WatermarkRemover:
    def __init__(self):
        self.detector = WatermarkDetector()
 
    def paraphrase(self, text: str) -> str:
        """Use a different model to paraphrase the text."""
        from openai import OpenAI
        client = OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"Paraphrase the following text while keeping "
                          f"the same meaning:\n\n{text}",
            }],
            max_tokens=500,
        )
        return response.choices[0].message.content
 
    def synonym_replace(self, text: str, replacement_rate: float = 0.3) -> str:
        """Replace a fraction of words with synonyms."""
        import random
        words = text.split()
        # Simplified: in practice, use WordNet or a synonym API
        result = []
        for word in words:
            if random.random() < replacement_rate and len(word) > 3:
                result.append(f"[SYN:{word}]")  # Placeholder for synonym
            else:
                result.append(word)
        return " ".join(result)
 
    def sentence_shuffle(self, text: str) -> str:
        """Reorder sentences while maintaining coherence."""
        import random
        sentences = text.split(". ")
        if len(sentences) > 2:
            # Keep first and last, shuffle middle
            middle = sentences[1:-1]
            random.shuffle(middle)
            sentences = [sentences[0]] + middle + [sentences[-1]]
        return ". ".join(sentences)
 
    def back_translate(self, text: str) -> str:
        """Translate to another language and back."""
        from openai import OpenAI
        client = OpenAI()
        # Step 1: English -> French
        fr = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user",
                       "content": f"Translate to French:\n\n{text}"}],
        ).choices[0].message.content
        # Step 2: French -> English
        en = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user",
                       "content": f"Translate to English:\n\n{fr}"}],
        ).choices[0].message.content
        return en
 
    def evaluate_removal(self, original: str, modified: str) -> dict:
        orig_result = self.detector.detect(original)
        mod_result = self.detector.detect(modified)
        return {
            "original_z": orig_result["z_score"],
            "modified_z": mod_result["z_score"],
            "watermark_removed": not mod_result["watermarked"],
            "original_detected": orig_result["watermarked"],
        }
 
if __name__ == "__main__":
    remover = WatermarkRemover()
    sample = "The process of renewable energy generation involves..."
 
    for method_name in ["paraphrase", "back_translate", "sentence_shuffle"]:
        method = getattr(remover, method_name)
        modified = method(sample)
        result = remover.evaluate_removal(sample, modified)
        print(f"{method_name}: z_score {result['original_z']:.2f} -> "
              f"{result['modified_z']:.2f} | "
              f"Removed: {result['watermark_removed']}")

python watermark_removal.py

Measure Robustness vs. Quality Trade-offs

Quantify how aggressive removal must be to eliminate the watermark and what quality cost that incurs.

# For each removal technique, measure:
# 1. Watermark detection z-score before and after
# 2. Semantic similarity (cosine similarity of embeddings)
# 3. Fluency score (perplexity under a reference model)
# 4. Factual accuracy preservation
 
# Plot: removal effectiveness vs. text quality degradation
# The goal is to find the "sweet spot" where watermark is removed
# with minimal quality loss

Write a Watermark Robustness Assessment

Document findings on the strengths and weaknesses of the watermarking scheme.

# Watermark Robustness Assessment
 
## Watermark Scheme Tested
- Type: Green-list statistical watermark (Kirchenbauer et al.)
- Parameters: gamma=0.5, delta=2.0
 
## Removal Technique Effectiveness
| Technique | Watermark Removed | Quality Preserved | Cost |
|---|---|---|---|
| Paraphrasing | Yes/No | High/Medium/Low | API calls |
| Back-translation | Yes/No | High/Medium/Low | 2x API calls |
| Synonym replacement | Yes/No | High/Medium/Low | Local compute |
| Sentence shuffling | Yes/No | High/Medium/Low | None |
 
## Recommendations
- Minimum text length for reliable detection: N tokens
- Most cost-effective removal method: [method]
- Watermark schemes should be combined with other provenance methods

Troubleshooting

Issue	Solution
Detector always says "not watermarked"	Verify hash key matches between generator and detector
Text quality degrades severely	Reduce replacement rate or use higher-quality paraphrase models
Z-scores are near zero for all text	Ensure the text is long enough (50+ tokens minimum)
GPU memory errors during generation	Use a smaller model or INT8 quantization for the generator

Quantization Exploit - Quantization as a potential watermark degradation mechanism
Training Data Extraction - Related detection and attribution challenges for AI-generated content
Model Extraction - Extract model capabilities that enable watermark analysis
Token Smuggling - Token-level manipulation related to watermark evasion strategies

References

"A Watermark for Large Language Models" - Kirchenbauer et al. (2023) - The foundational green-list watermarking scheme this lab implements and attacks
"On the Reliability of Watermarks for Large Language Models" - Christ et al. (2023) - Analysis of watermark robustness under various attack strategies
"Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense" - Krishna et al. (2023) - Paraphrase attacks on AI text detection and watermarking
"Undetectable Watermarks for Language Models" - Christ et al. (2024) - Theoretical foundations for watermark undetectability and robustness

Knowledge Check

How do green-list watermarking schemes embed a signal in generated text?

Edit this page on GitHub

Lab: AI Watermark Detection & Removal

expert9 min readUpdated 2026-03-13

Hands-on lab exploring techniques for detecting and removing statistical watermarks embedded in AI-generated text, and evaluating watermark robustness.

lab expert watermarking detection hands-on

Prerequisites

Understanding of language model token sampling (temperature, top-k, top-p)
Familiarity with statistical hypothesis testing
Python with NumPy, SciPy, and Hugging Face Transformers
Background reading on the Kirchenbauer et al. "green list" watermarking scheme

Background

Lab Exercises

Implement a Basic Watermark Detector

Build a detector that can identify watermarked text by analyzing token distribution patterns.

#!/usr/bin/env python3
"""Detect statistical watermarks in AI-generated text."""
 
import hashlib
import numpy as np
from scipy import stats
from transformers import AutoTokenizer
 
class WatermarkDetector:
    def __init__(self, model_name: str = "meta-llama/Llama-3.1-8B-Instruct",
                 gamma: float = 0.5, hash_key: int = 42):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.gamma = gamma  # Proportion of vocab in green list
        self.hash_key = hash_key
        self.vocab_size = self.tokenizer.vocab_size
 
    def _get_green_list(self, prev_token_id: int) -> set[int]:
        """Generate the green list for a given previous token."""
        seed = hashlib.sha256(
            f"{self.hash_key}:{prev_token_id}".encode()
        ).digest()
        rng = np.random.RandomState(
            int.from_bytes(seed[:4], byteorder="big")
        )
        green_size = int(self.vocab_size * self.gamma)
        green_tokens = set(rng.choice(self.vocab_size, green_size, replace=False))
        return green_tokens
 
    def detect(self, text: str) -> dict:
        """Test whether text contains a watermark signal."""
        token_ids = self.tokenizer.encode(text)
 
        if len(token_ids) < 10:
            return {"watermarked": False, "reason": "Text too short"}
 
        green_count = 0
        total_scored = 0
 
        for i in range(1, len(token_ids)):
            prev_id = token_ids[i - 1]
            current_id = token_ids[i]
            green_list = self._get_green_list(prev_id)
 
            if current_id in green_list:
                green_count += 1
            total_scored += 1
 
        # Statistical test: under no watermark, green proportion ~ gamma
        green_proportion = green_count / total_scored
        z_score = (green_proportion - self.gamma) / np.sqrt(
            self.gamma * (1 - self.gamma) / total_scored
        )
        p_value = 1 - stats.norm.cdf(z_score)
 
        return {
            "watermarked": p_value < 0.01,
            "green_proportion": green_proportion,
            "expected_proportion": self.gamma,
            "z_score": z_score,
            "p_value": p_value,
            "tokens_analyzed": total_scored,
        }
 
if __name__ == "__main__":
    detector = WatermarkDetector()
 
    # Test with known AI-generated text
    ai_text = "The process of photosynthesis converts sunlight into energy..."
    result = detector.detect(ai_text)
    print(f"AI text: watermarked={result['watermarked']}, z={result['z_score']:.2f}")
 
    human_text = "I went to the store yesterday and bought some groceries."
    result = detector.detect(human_text)
    print(f"Human text: watermarked={result['watermarked']}, z={result['z_score']:.2f}")

python watermark_detector.py

Generate Watermarked Text for Testing

Create a watermarked text generator to produce test samples with known watermark signals.

#!/usr/bin/env python3
"""Generate watermarked text using green-list biasing."""
 
import hashlib
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
 
class WatermarkGenerator:
    def __init__(self, model_name: str = "meta-llama/Llama-3.1-8B-Instruct",
                 gamma: float = 0.5, delta: float = 2.0, hash_key: int = 42):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map="auto"
        )
        self.gamma = gamma
        self.delta = delta  # Bias strength added to green-list logits
        self.hash_key = hash_key
        self.vocab_size = self.tokenizer.vocab_size
 
    def _get_green_mask(self, prev_token_id: int) -> torch.Tensor:
        seed = hashlib.sha256(
            f"{self.hash_key}:{prev_token_id}".encode()
        ).digest()
        rng = np.random.RandomState(
            int.from_bytes(seed[:4], byteorder="big")
        )
        green_size = int(self.vocab_size * self.gamma)
        green_indices = rng.choice(self.vocab_size, green_size, replace=False)
        mask = torch.zeros(self.vocab_size)
        mask[green_indices] = self.delta
        return mask
 
    def generate(self, prompt: str, max_tokens: int = 200) -> str:
        input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to(
            self.model.device
        )
        generated = input_ids.clone()
 
        for _ in range(max_tokens):
            with torch.no_grad():
                outputs = self.model(generated)
            logits = outputs.logits[0, -1, :]
 
            prev_token = generated[0, -1].item()
            green_mask = self._get_green_mask(prev_token).to(logits.device)
            logits = logits + green_mask
 
            probs = torch.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, 1)
            generated = torch.cat([generated, next_token.unsqueeze(0)], dim=-1)
 
            if next_token.item() == self.tokenizer.eos_token_id:
                break
 
        new_tokens = generated[0, input_ids.shape[1]:]
        return self.tokenizer.decode(new_tokens, skip_special_tokens=True)
 
if __name__ == "__main__":
    gen = WatermarkGenerator()
    text = gen.generate("Explain the benefits of renewable energy:")
    print(f"Watermarked text:\n{text}")

Apply Watermark Removal Techniques

Test various text transformation methods that aim to remove the watermark while preserving semantic content.

#!/usr/bin/env python3
"""Techniques for removing watermarks from AI-generated text."""
 
from watermark_detector import WatermarkDetector
 
class WatermarkRemover:
    def __init__(self):
        self.detector = WatermarkDetector()
 
    def paraphrase(self, text: str) -> str:
        """Use a different model to paraphrase the text."""
        from openai import OpenAI
        client = OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"Paraphrase the following text while keeping "
                          f"the same meaning:\n\n{text}",
            }],
            max_tokens=500,
        )
        return response.choices[0].message.content
 
    def synonym_replace(self, text: str, replacement_rate: float = 0.3) -> str:
        """Replace a fraction of words with synonyms."""
        import random
        words = text.split()
        # Simplified: in practice, use WordNet or a synonym API
        result = []
        for word in words:
            if random.random() < replacement_rate and len(word) > 3:
                result.append(f"[SYN:{word}]")  # Placeholder for synonym
            else:
                result.append(word)
        return " ".join(result)
 
    def sentence_shuffle(self, text: str) -> str:
        """Reorder sentences while maintaining coherence."""
        import random
        sentences = text.split(". ")
        if len(sentences) > 2:
            # Keep first and last, shuffle middle
            middle = sentences[1:-1]
            random.shuffle(middle)
            sentences = [sentences[0]] + middle + [sentences[-1]]
        return ". ".join(sentences)
 
    def back_translate(self, text: str) -> str:
        """Translate to another language and back."""
        from openai import OpenAI
        client = OpenAI()
        # Step 1: English -> French
        fr = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user",
                       "content": f"Translate to French:\n\n{text}"}],
        ).choices[0].message.content
        # Step 2: French -> English
        en = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user",
                       "content": f"Translate to English:\n\n{fr}"}],
        ).choices[0].message.content
        return en
 
    def evaluate_removal(self, original: str, modified: str) -> dict:
        orig_result = self.detector.detect(original)
        mod_result = self.detector.detect(modified)
        return {
            "original_z": orig_result["z_score"],
            "modified_z": mod_result["z_score"],
            "watermark_removed": not mod_result["watermarked"],
            "original_detected": orig_result["watermarked"],
        }
 
if __name__ == "__main__":
    remover = WatermarkRemover()
    sample = "The process of renewable energy generation involves..."
 
    for method_name in ["paraphrase", "back_translate", "sentence_shuffle"]:
        method = getattr(remover, method_name)
        modified = method(sample)
        result = remover.evaluate_removal(sample, modified)
        print(f"{method_name}: z_score {result['original_z']:.2f} -> "
              f"{result['modified_z']:.2f} | "
              f"Removed: {result['watermark_removed']}")

python watermark_removal.py

Measure Robustness vs. Quality Trade-offs

Quantify how aggressive removal must be to eliminate the watermark and what quality cost that incurs.

# For each removal technique, measure:
# 1. Watermark detection z-score before and after
# 2. Semantic similarity (cosine similarity of embeddings)
# 3. Fluency score (perplexity under a reference model)
# 4. Factual accuracy preservation
 
# Plot: removal effectiveness vs. text quality degradation
# The goal is to find the "sweet spot" where watermark is removed
# with minimal quality loss

Write a Watermark Robustness Assessment

Document findings on the strengths and weaknesses of the watermarking scheme.

# Watermark Robustness Assessment
 
## Watermark Scheme Tested
- Type: Green-list statistical watermark (Kirchenbauer et al.)
- Parameters: gamma=0.5, delta=2.0
 
## Removal Technique Effectiveness
| Technique | Watermark Removed | Quality Preserved | Cost |
|---|---|---|---|
| Paraphrasing | Yes/No | High/Medium/Low | API calls |
| Back-translation | Yes/No | High/Medium/Low | 2x API calls |
| Synonym replacement | Yes/No | High/Medium/Low | Local compute |
| Sentence shuffling | Yes/No | High/Medium/Low | None |
 
## Recommendations
- Minimum text length for reliable detection: N tokens
- Most cost-effective removal method: [method]
- Watermark schemes should be combined with other provenance methods

Troubleshooting

Issue	Solution
Detector always says "not watermarked"	Verify hash key matches between generator and detector
Text quality degrades severely	Reduce replacement rate or use higher-quality paraphrase models
Z-scores are near zero for all text	Ensure the text is long enough (50+ tokens minimum)
GPU memory errors during generation	Use a smaller model or INT8 quantization for the generator

Quantization Exploit - Quantization as a potential watermark degradation mechanism
Training Data Extraction - Related detection and attribution challenges for AI-generated content
Model Extraction - Extract model capabilities that enable watermark analysis
Token Smuggling - Token-level manipulation related to watermark evasion strategies

References

"A Watermark for Large Language Models" - Kirchenbauer et al. (2023) - The foundational green-list watermarking scheme this lab implements and attacks
"On the Reliability of Watermarks for Large Language Models" - Christ et al. (2023) - Analysis of watermark robustness under various attack strategies
"Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense" - Krishna et al. (2023) - Paraphrase attacks on AI text detection and watermarking
"Undetectable Watermarks for Language Models" - Christ et al. (2024) - Theoretical foundations for watermark undetectability and robustness

Knowledge Check

How do green-list watermarking schemes embed a signal in generated text?

Edit this page on GitHub

Lab: AI Watermark Detection & Removal

Implement a Basic Watermark Detector

Generate Watermarked Text for Testing

Apply Watermark Removal Techniques

Measure Robustness vs. Quality Trade-offs

Write a Watermark Robustness Assessment

Related articles

Lab: AI Watermark Detection & Removal

Implement a Basic Watermark Detector

Generate Watermarked Text for Testing

Apply Watermark Removal Techniques

Measure Robustness vs. Quality Trade-offs

Write a Watermark Robustness Assessment

Related articles