Lab: AI Watermark Detection & Removal
Hands-on lab exploring techniques for detecting and removing statistical watermarks embedded in AI-generated text, and evaluating watermark robustness.
Prerequisites
- Understanding of language model token sampling (temperature, top-k, top-p)
- Familiarity with statistical hypothesis testing
- Python with NumPy, SciPy, and Hugging Face Transformers
- Background reading on the Kirchenbauer et al. "green list" watermarking scheme
Background
AI watermarking works by partitioning the vocabulary into "green" and "red" lists at each token position (based on a hash of the preceding token) and biasing generation toward green-list tokens. This bias is imperceptible to readers but statistically detectable by anyone who knows the hash key.
Lab Exercises
Implement a Basic Watermark Detector
Build a detector that can identify watermarked text by analyzing token distribution patterns.
#!/usr/bin/env python3 """Detect statistical watermarks in AI-generated text.""" import hashlib import numpy as np from scipy import stats from transformers import AutoTokenizer class WatermarkDetector: def __init__(self, model_name: str = "meta-llama/Llama-3.1-8B-Instruct", gamma: float = 0.5, hash_key: int = 42): self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.gamma = gamma # Proportion of vocab in green list self.hash_key = hash_key self.vocab_size = self.tokenizer.vocab_size def _get_green_list(self, prev_token_id: int) -> set[int]: """Generate the green list for a given previous token.""" seed = hashlib.sha256( f"{self.hash_key}:{prev_token_id}".encode() ).digest() rng = np.random.RandomState( int.from_bytes(seed[:4], byteorder="big") ) green_size = int(self.vocab_size * self.gamma) green_tokens = set(rng.choice(self.vocab_size, green_size, replace=False)) return green_tokens def detect(self, text: str) -> dict: """Test whether text contains a watermark signal.""" token_ids = self.tokenizer.encode(text) if len(token_ids) < 10: return {"watermarked": False, "reason": "Text too short"} green_count = 0 total_scored = 0 for i in range(1, len(token_ids)): prev_id = token_ids[i - 1] current_id = token_ids[i] green_list = self._get_green_list(prev_id) if current_id in green_list: green_count += 1 total_scored += 1 # Statistical test: under no watermark, green proportion ~ gamma green_proportion = green_count / total_scored z_score = (green_proportion - self.gamma) / np.sqrt( self.gamma * (1 - self.gamma) / total_scored ) p_value = 1 - stats.norm.cdf(z_score) return { "watermarked": p_value < 0.01, "green_proportion": green_proportion, "expected_proportion": self.gamma, "z_score": z_score, "p_value": p_value, "tokens_analyzed": total_scored, } if __name__ == "__main__": detector = WatermarkDetector() # Test with known AI-generated text ai_text = "The process of photosynthesis converts sunlight into energy..." result = detector.detect(ai_text) print(f"AI text: watermarked={result['watermarked']}, z={result['z_score']:.2f}") human_text = "I went to the store yesterday and bought some groceries." result = detector.detect(human_text) print(f"Human text: watermarked={result['watermarked']}, z={result['z_score']:.2f}")python watermark_detector.pyGenerate Watermarked Text for Testing
Create a watermarked text generator to produce test samples with known watermark signals.
#!/usr/bin/env python3 """Generate watermarked text using green-list biasing.""" import hashlib import torch import numpy as np from transformers import AutoModelForCausalLM, AutoTokenizer class WatermarkGenerator: def __init__(self, model_name: str = "meta-llama/Llama-3.1-8B-Instruct", gamma: float = 0.5, delta: float = 2.0, hash_key: int = 42): self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" ) self.gamma = gamma self.delta = delta # Bias strength added to green-list logits self.hash_key = hash_key self.vocab_size = self.tokenizer.vocab_size def _get_green_mask(self, prev_token_id: int) -> torch.Tensor: seed = hashlib.sha256( f"{self.hash_key}:{prev_token_id}".encode() ).digest() rng = np.random.RandomState( int.from_bytes(seed[:4], byteorder="big") ) green_size = int(self.vocab_size * self.gamma) green_indices = rng.choice(self.vocab_size, green_size, replace=False) mask = torch.zeros(self.vocab_size) mask[green_indices] = self.delta return mask def generate(self, prompt: str, max_tokens: int = 200) -> str: input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to( self.model.device ) generated = input_ids.clone() for _ in range(max_tokens): with torch.no_grad(): outputs = self.model(generated) logits = outputs.logits[0, -1, :] prev_token = generated[0, -1].item() green_mask = self._get_green_mask(prev_token).to(logits.device) logits = logits + green_mask probs = torch.softmax(logits, dim=-1) next_token = torch.multinomial(probs, 1) generated = torch.cat([generated, next_token.unsqueeze(0)], dim=-1) if next_token.item() == self.tokenizer.eos_token_id: break new_tokens = generated[0, input_ids.shape[1]:] return self.tokenizer.decode(new_tokens, skip_special_tokens=True) if __name__ == "__main__": gen = WatermarkGenerator() text = gen.generate("Explain the benefits of renewable energy:") print(f"Watermarked text:\n{text}")Apply Watermark Removal Techniques
Test various text transformation methods that aim to remove the watermark while preserving semantic content.
#!/usr/bin/env python3 """Techniques for removing watermarks from AI-generated text.""" from watermark_detector import WatermarkDetector class WatermarkRemover: def __init__(self): self.detector = WatermarkDetector() def paraphrase(self, text: str) -> str: """Use a different model to paraphrase the text.""" from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "user", "content": f"Paraphrase the following text while keeping " f"the same meaning:\n\n{text}", }], max_tokens=500, ) return response.choices[0].message.content def synonym_replace(self, text: str, replacement_rate: float = 0.3) -> str: """Replace a fraction of words with synonyms.""" import random words = text.split() # Simplified: in practice, use WordNet or a synonym API result = [] for word in words: if random.random() < replacement_rate and len(word) > 3: result.append(f"[SYN:{word}]") # Placeholder for synonym else: result.append(word) return " ".join(result) def sentence_shuffle(self, text: str) -> str: """Reorder sentences while maintaining coherence.""" import random sentences = text.split(". ") if len(sentences) > 2: # Keep first and last, shuffle middle middle = sentences[1:-1] random.shuffle(middle) sentences = [sentences[0]] + middle + [sentences[-1]] return ". ".join(sentences) def back_translate(self, text: str) -> str: """Translate to another language and back.""" from openai import OpenAI client = OpenAI() # Step 1: English -> French fr = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": f"Translate to French:\n\n{text}"}], ).choices[0].message.content # Step 2: French -> English en = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": f"Translate to English:\n\n{fr}"}], ).choices[0].message.content return en def evaluate_removal(self, original: str, modified: str) -> dict: orig_result = self.detector.detect(original) mod_result = self.detector.detect(modified) return { "original_z": orig_result["z_score"], "modified_z": mod_result["z_score"], "watermark_removed": not mod_result["watermarked"], "original_detected": orig_result["watermarked"], } if __name__ == "__main__": remover = WatermarkRemover() sample = "The process of renewable energy generation involves..." for method_name in ["paraphrase", "back_translate", "sentence_shuffle"]: method = getattr(remover, method_name) modified = method(sample) result = remover.evaluate_removal(sample, modified) print(f"{method_name}: z_score {result['original_z']:.2f} -> " f"{result['modified_z']:.2f} | " f"Removed: {result['watermark_removed']}")python watermark_removal.pyMeasure Robustness vs. Quality Trade-offs
Quantify how aggressive removal must be to eliminate the watermark and what quality cost that incurs.
# For each removal technique, measure: # 1. Watermark detection z-score before and after # 2. Semantic similarity (cosine similarity of embeddings) # 3. Fluency score (perplexity under a reference model) # 4. Factual accuracy preservation # Plot: removal effectiveness vs. text quality degradation # The goal is to find the "sweet spot" where watermark is removed # with minimal quality lossWrite a Watermark Robustness Assessment
Document findings on the strengths and weaknesses of the watermarking scheme.
# Watermark Robustness Assessment ## Watermark Scheme Tested - Type: Green-list statistical watermark (Kirchenbauer et al.) - Parameters: gamma=0.5, delta=2.0 ## Removal Technique Effectiveness | Technique | Watermark Removed | Quality Preserved | Cost | |---|---|---|---| | Paraphrasing | Yes/No | High/Medium/Low | API calls | | Back-translation | Yes/No | High/Medium/Low | 2x API calls | | Synonym replacement | Yes/No | High/Medium/Low | Local compute | | Sentence shuffling | Yes/No | High/Medium/Low | None | ## Recommendations - Minimum text length for reliable detection: N tokens - Most cost-effective removal method: [method] - Watermark schemes should be combined with other provenance methods
Troubleshooting
| Issue | Solution |
|---|---|
| Detector always says "not watermarked" | Verify hash key matches between generator and detector |
| Text quality degrades severely | Reduce replacement rate or use higher-quality paraphrase models |
| Z-scores are near zero for all text | Ensure the text is long enough (50+ tokens minimum) |
| GPU memory errors during generation | Use a smaller model or INT8 quantization for the generator |
Related Topics
- Quantization Exploit - Quantization as a potential watermark degradation mechanism
- Training Data Extraction - Related detection and attribution challenges for AI-generated content
- Model Extraction - Extract model capabilities that enable watermark analysis
- Token Smuggling - Token-level manipulation related to watermark evasion strategies
References
- "A Watermark for Large Language Models" - Kirchenbauer et al. (2023) - The foundational green-list watermarking scheme this lab implements and attacks
- "On the Reliability of Watermarks for Large Language Models" - Christ et al. (2023) - Analysis of watermark robustness under various attack strategies
- "Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense" - Krishna et al. (2023) - Paraphrase attacks on AI text detection and watermarking
- "Undetectable Watermarks for Language Models" - Christ et al. (2024) - Theoretical foundations for watermark undetectability and robustness
How do green-list watermarking schemes embed a signal in generated text?