Watermark Removal Techniques

advanced7 min readUpdated 2026-03-13

Techniques for removing AI watermarks: paraphrasing attacks, token substitution, embedding space perturbation, and implications for model provenance and accountability.

watermark removal paraphrasing provenance accountability detection-evasion

AI watermarks are designed to mark text as AI-generated, enabling attribution and provenance tracking. However, current watermarking schemes are vulnerable to removal attacks that strip the statistical signal while preserving text quality. This has significant implications for model accountability, content provenance, and intellectual property protection.

How AI Watermarking Works

The Green-Red List Approach (Kirchenbauer et al.)

The most widely studied approach partitions the vocabulary into "green" and "red" lists at each generation step, using a secret key:

import hashlib
import torch
 
def watermarked_sampling(logits, prev_token, watermark_key, bias=2.0):
    """Add bias to 'green list' tokens during sampling.
    The green list is determined by a hash of the previous token
    and the watermark key."""
    # Deterministic green/red split based on previous token + key
    seed = int(hashlib.sha256(
        f"{watermark_key}:{prev_token}".encode()
    ).hexdigest()[:8], 16)
    rng = torch.Generator().manual_seed(seed)
 
    vocab_size = logits.shape[-1]
    perm = torch.randperm(vocab_size, generator=rng)
    green_tokens = perm[:vocab_size // 2]
 
    # Bias logits toward green tokens
    watermarked_logits = logits.clone()
    watermarked_logits[green_tokens] += bias
 
    return watermarked_logits
 
def detect_watermark(text, tokenizer, watermark_key, threshold=0.5):
    """Detect watermark by measuring green token fraction."""
    tokens = tokenizer.encode(text)
    green_count = 0
 
    for i in range(1, len(tokens)):
        seed = int(hashlib.sha256(
            f"{watermark_key}:{tokens[i-1]}".encode()
        ).hexdigest()[:8], 16)
        rng = torch.Generator().manual_seed(seed)
        perm = torch.randperm(tokenizer.vocab_size, generator=rng)
        green_tokens = set(perm[:tokenizer.vocab_size // 2].tolist())
 
        if tokens[i] in green_tokens:
            green_count += 1
 
    green_fraction = green_count / (len(tokens) - 1)
    # Unwatermarked text: ~50% green tokens
    # Watermarked text: significantly > 50% green tokens
    z_score = (green_fraction - 0.5) / (0.5 / (len(tokens) - 1) ** 0.5)
    return z_score > threshold, z_score, green_fraction

Watermark Detection Properties

Property	Value	Implication
Expected green fraction (no watermark)	~50%	Statistical baseline
Expected green fraction (watermarked)	65-85%	Depends on bias strength
Minimum detectable text length	~25 tokens	Short texts are unreliable
False positive rate (z > 4)	~0.003%	Very low for long texts
False negative rate	Depends on bias strength	Higher bias = lower FN rate

Removal Technique 1: Paraphrasing Attacks

The most effective and accessible removal technique: paraphrase the watermarked text using a different (non-watermarked) model.

def paraphrase_removal(watermarked_text, paraphrase_model, tokenizer):
    """Remove watermark by paraphrasing with a non-watermarked model."""
    prompt = f"""Rewrite the following text to convey the same meaning
using different words and sentence structures. Preserve all factual
content and technical accuracy.
 
Original text:
{watermarked_text}
 
Rewritten text:"""
 
    # The paraphrase model generates new tokens without the watermark bias
    paraphrased = paraphrase_model.generate(
        tokenizer(prompt, return_tensors="pt").input_ids,
        max_new_tokens=len(watermarked_text.split()) * 2,
        temperature=0.7,
    )
    return tokenizer.decode(paraphrased[0], skip_special_tokens=True)

Paraphrasing Effectiveness

Paraphrasing Method	Watermark Removal Rate	Semantic Preservation	Quality
LLM paraphrase (GPT-4 class)	85-95%	90-95%	High
LLM paraphrase (7B model)	70-85%	80-90%	Medium
Back-translation (EN→FR→EN)	60-80%	75-85%	Medium
Synonym substitution	40-60%	85-95%	High
Random word deletion/insertion	30-50%	60-80%	Low

Removal Technique 2: Token Substitution

Targeted replacement of tokens to reduce the green token fraction toward the 50% baseline:

def token_substitution_removal(text, tokenizer, watermark_key, model):
    """Remove watermark by selectively replacing green tokens
    with semantically equivalent red tokens."""
    tokens = tokenizer.encode(text)
    modified_tokens = list(tokens)
 
    for i in range(1, len(tokens)):
        # Determine if current token is green
        green_set = get_green_tokens(tokens[i-1], watermark_key, tokenizer.vocab_size)
 
        if tokens[i] in green_set:
            # Find semantically similar red token
            # Get model's probability distribution at this position
            context = torch.tensor([modified_tokens[:i]])
            logits = model(context).logits[0, -1]
 
            # Find highest-probability red token
            red_set = set(range(tokenizer.vocab_size)) - green_set
            red_probs = [(t, logits[t].item()) for t in red_set]
            red_probs.sort(key=lambda x: x[1], reverse=True)
 
            best_red_token = red_probs[0][0]
            # Only substitute if the red token is reasonably probable
            if logits[best_red_token] > logits[tokens[i]] - 3.0:
                modified_tokens[i] = best_red_token
 
    return tokenizer.decode(modified_tokens)

Removal Technique 3: Embedding Space Perturbation

For watermarks that operate in embedding space, perturbation can disrupt the watermark signal:

def embedding_perturbation_removal(text, model, tokenizer, noise_scale=0.1):
    """Remove embedding-level watermarks by adding noise
    to token embeddings and re-decoding."""
    tokens = tokenizer(text, return_tensors="pt")
    embeddings = model.get_input_embeddings()(tokens.input_ids)
 
    # Add calibrated noise to disrupt watermark signal
    noise = torch.randn_like(embeddings) * noise_scale
    perturbed_embeddings = embeddings + noise
 
    # Re-decode from perturbed embeddings
    with torch.no_grad():
        outputs = model(inputs_embeds=perturbed_embeddings)
        new_tokens = outputs.logits.argmax(dim=-1)
 
    return tokenizer.decode(new_tokens[0], skip_special_tokens=True)

Implications for Model Provenance

The Watermark Arms Race

Generation	Watermark Approach	Removal Attack	Result
1st	Green-red token bias	Paraphrasing	Watermark removed
2nd	Semantic watermarking	Style transfer	Partially effective
3rd	Multi-level watermarking	Ensemble paraphrasing	Cat-and-mouse continues

Current Limitations

Alternative Provenance Approaches

Approach	Mechanism	Robustness to Removal
Text watermarking	Statistical bias in token selection	Low -- paraphrasing removes
Model fingerprinting	Unique behavioral signatures in model weights	Medium -- survives text-level attacks
Stylometric analysis	Statistical writing style features	Medium -- varies by text length
Retrieval-based detection	Compare against database of known outputs	High -- requires comprehensive database
Cryptographic signing	Sign outputs at generation time	High -- but requires trust in signing key

Ethical Considerations

Advanced Training Attack Vectors -- Overview of advanced training attacks
Distillation-Based Model Extraction -- Model theft context
Synthetic Data Pipeline Attacks -- Synthetic data provenance
Defense Evasion -- Output filtering bypass

Knowledge Check

Why is LLM-based paraphrasing the most effective watermark removal technique?

References

A Watermark for Large Language Models (Kirchenbauer et al., 2023) -- Green-red list watermarking
On the Reliability of Watermarks for Large Language Models (Piet et al., 2023) -- Watermark robustness analysis
Paraphrasing evades detectors of AI-generated text (Sadasivan et al., 2023) -- Paraphrasing attacks

Watermark Removal Techniques

Related articles

Watermark Removal Techniques

Related articles