AI 浮水印與攻擊

進階6 分鐘閱讀更新於 2026-03-15

當前針對模型輸出與訓練資料的 AI 浮水印方案、其安全屬性,以及已知可移除、偽造或規避浮水印的攻擊。

watermarking provenance detection attacks text-watermark

AI 浮水印與攻擊

AI 浮水印旨在解決兩個關鍵問題:偵測內容是否由 AI 生成,以及將該內容歸屬到特定模型或供應商。當 AI 生成的內容越來越難與人類創作區分,浮水印就成為維持來源可追溯與信任的主要技術路徑。然而,每個浮水印方案都必須在可偵測性、強健性與品質之間做出根本取捨,而攻擊者積極利用這些取捨。

文字浮水印方案

綠-紅符元浮水印(Kirchenbauer 等人)

最具影響力的文字浮水印方法,根據前綴符元的雜湊將詞彙劃分為「綠」與「紅」清單,然後在生成時偏向綠色符元:

import hashlib
import torch
 
class GreenRedWatermark:
    """Kirchenbauer et al. green-red list watermarking."""
 
    def __init__(self, vocab_size, green_fraction=0.5,
                 watermark_strength=2.0, secret_key="key"):
        self.vocab_size = vocab_size
        self.green_fraction = green_fraction
        self.strength = watermark_strength  # delta parameter
        self.key = secret_key
 
    def get_green_tokens(self, prefix_token_id):
        """Compute green list based on preceding token."""
        seed = hashlib.sha256(
            f"{self.key}:{prefix_token_id}".encode()
        ).digest()
        rng = torch.Generator().manual_seed(
            int.from_bytes(seed[:4], "big")
        )
 
        # Random permutation of vocabulary
        perm = torch.randperm(self.vocab_size, generator=rng)
 
        # First green_fraction of permutation are green tokens
        green_count = int(self.vocab_size * self.green_fraction)
        green_tokens = set(perm[:green_count].tolist())
        return green_tokens
 
    def apply_watermark(self, logits, prefix_token_id):
        """Bias logits toward green tokens."""
        green_tokens = self.get_green_tokens(prefix_token_id)
 
        watermarked_logits = logits.clone()
        for token_id in green_tokens:
            watermarked_logits[token_id] += self.strength
 
        return watermarked_logits
 
    def detect_watermark(self, token_ids, threshold=4.0):
        """Detect watermark by counting green token frequency."""
        green_count = 0
        total = 0
 
        for i in range(1, len(token_ids)):
            green_tokens = self.get_green_tokens(token_ids[i - 1])
            if token_ids[i] in green_tokens:
                green_count += 1
            total += 1
 
        if total == 0:
            return {"detected": False, "z_score": 0}
 
        # Compute z-score under null hypothesis
        # (random text has green_fraction green tokens)
        expected = total * self.green_fraction
        std = (total * self.green_fraction * (1 - self.green_fraction)) ** 0.5
        z_score = (green_count - expected) / std
 
        return {
            "detected": z_score > threshold,
            "z_score": z_score,
            "green_fraction_observed": green_count / total,
            "green_fraction_expected": self.green_fraction
        }

無失真浮水印(Christ 等人)

此方法使用從模型原始分布的拒絕取樣嵌入浮水印,可證明不會造成任何品質下降:

class DistortionFreeWatermark:
    """
    Watermark that provably does not alter output distribution.
    Uses shared randomness between encoder and decoder.
    """
 
    def __init__(self, secret_key):
        self.key = secret_key
 
    def generate_watermarked(self, model, prompt, max_tokens=100):
        """Generate watermarked text using Gumbel softmax trick."""
        tokens = []
        current_input = prompt
 
        for step in range(max_tokens):
            # Get model's probability distribution
            logits = model(current_input)
            probs = torch.softmax(logits[-1], dim=-1)
 
            # Generate pseudorandom Gumbel noise (keyed by position)
            seed = self.position_seed(step, tokens)
            gumbel_noise = self.sample_gumbel(probs.shape, seed)
 
            # Argmax of log(probs) + gumbel_noise is equivalent
            # to sampling from probs (Gumbel-max trick)
            # But the specific noise is reproducible with the key
            selected = (probs.log() + gumbel_noise).argmax()
 
            tokens.append(selected.item())
            current_input = torch.cat([current_input, selected.unsqueeze(0)])
 
        return tokens
 
    def detect(self, tokens, model, prompt):
        """Detect watermark by checking if tokens match expected noise."""
        log_likelihood_ratio = 0
 
        for step, token in enumerate(tokens):
            logits = model(prompt)  # Need to re-run model
            probs = torch.softmax(logits[-1], dim=-1)
 
            seed = self.position_seed(step, tokens[:step])
            gumbel_noise = self.sample_gumbel(probs.shape, seed)
 
            expected_token = (probs.log() + gumbel_noise).argmax()
 
            if token == expected_token.item():
                log_likelihood_ratio += 1
 
        return log_likelihood_ratio / len(tokens)

浮水印攻擊

改寫攻擊

最簡單且最有效的攻擊:以另一個模型改寫被加浮水印的文字:

def paraphrase_attack(watermarked_text, paraphraser_model):
    """
    Remove watermark by paraphrasing with a non-watermarked model.
    Effectiveness depends on paraphrase quality.
    """
    prompt = f"""Rewrite the following text to convey the exact same
meaning using completely different wording and sentence structure.
Maintain the same level of detail and technical accuracy.
 
Original: {watermarked_text}
 
Rewritten version:"""
 
    return paraphraser_model.generate(prompt)

符元替換攻擊

系統化地替換符元以降低綠符元頻率,同時保留意義:

def token_substitution_attack(text, tokenizer, synonym_dict,
                                substitution_rate=0.3):
    """
    Replace a fraction of tokens with synonyms to disrupt watermark.
    Lower substitution rates preserve quality but may leave
    detectable watermark signal.
    """
    tokens = tokenizer.tokenize(text)
    modified = []
 
    for i, token in enumerate(tokens):
        if (random.random() < substitution_rate and
            token in synonym_dict):
            # Replace with random synonym
            synonym = random.choice(synonym_dict[token])
            modified.append(synonym)
        else:
            modified.append(token)
 
    return tokenizer.detokenize(modified)

表情符號與 Unicode 攻擊

插入零寬字元或 Unicode 變體以干擾符元層級的浮水印偵測:

def unicode_disruption_attack(text):
    """
    Insert zero-width characters between tokens to disrupt
    watermark detection that depends on token boundaries.
    """
    zero_width_chars = [
        '\u200b',  # Zero-width space
        '\u200c',  # Zero-width non-joiner
        '\u200d',  # Zero-width joiner
        '\ufeff',  # Zero-width no-break space
    ]
 
    words = text.split()
    disrupted = []
 
    for word in words:
        # Randomly insert zero-width character within word
        if len(word) > 3 and random.random() > 0.5:
            pos = random.randint(1, len(word) - 1)
            zwc = random.choice(zero_width_chars)
            word = word[:pos] + zwc + word[pos:]
        disrupted.append(word)
 
    return ' '.join(disrupted)

偽造攻擊(浮水印偽造)

若浮水印金鑰或演算法被發現,攻擊者可在人類撰寫的文字中嵌入假浮水印:

def watermark_spoofing(human_text, watermark_key, tokenizer,
                        green_fraction=0.5):
    """
    Add a false watermark to human-written text to frame it
    as AI-generated, or to attribute it to a specific model.
    """
    tokens = tokenizer.encode(human_text)
    watermarker = GreenRedWatermark(
        tokenizer.vocab_size, green_fraction,
        secret_key=watermark_key
    )
 
    spoofed_tokens = []
    for i, token in enumerate(tokens):
        if i == 0:
            spoofed_tokens.append(token)
            continue
 
        green_tokens = watermarker.get_green_tokens(spoofed_tokens[-1])
 
        if token in green_tokens:
            # Already green -- keep it
            spoofed_tokens.append(token)
        else:
            # Find a green synonym
            synonyms = find_synonyms(token, tokenizer)
            green_synonyms = [s for s in synonyms if s in green_tokens]
            if green_synonyms:
                spoofed_tokens.append(random.choice(green_synonyms))
            else:
                spoofed_tokens.append(token)
 
    return tokenizer.decode(spoofed_tokens)

遞迴式浮水印移除

依序使用多個模型以漸進方式移除浮水印訊號:

def recursive_removal(text, models, rounds=3):
    """
    Pass text through multiple non-watermarked models
    to progressively destroy watermark signal.
    """
    current = text
 
    for round_num in range(rounds):
        model = models[round_num % len(models)]
 
        current = model.generate(
            f"Faithfully reproduce the following text with minor "
            f"stylistic improvements:\n\n{current}"
        )
 
    return current

影像浮水印與攻擊

Stable Signature(Fernandez 等人)

透過微調潛在擴散模型的解碼器嵌入浮水印:

class StableSignatureAttack:
    """Attacks against Stable Signature image watermarking."""
 
    def regeneration_attack(self, watermarked_image, clean_model):
        """
        Remove watermark by encoding image to latent space
        with a clean (non-watermarked) model, then decoding.
        """
        # Encode with any VAE encoder
        latent = clean_model.encoder(watermarked_image)
 
        # Add small noise to disrupt watermark in latent space
        noisy_latent = latent + torch.randn_like(latent) * 0.05
 
        # Decode with clean decoder (no watermark)
        cleaned_image = clean_model.decoder(noisy_latent)
 
        return cleaned_image
 
    def adversarial_perturbation_attack(self, watermarked_image,
                                          detector, epsilon=0.01):
        """
        Add imperceptible perturbation that causes detector
        to miss the watermark.
        """
        image = watermarked_image.clone().requires_grad_(True)
        optimizer = torch.optim.Adam([image], lr=0.001)
 
        for _ in range(100):
            optimizer.zero_grad()
 
            # Minimize detector confidence
            detection_score = detector(image)
            loss = detection_score  # Want to minimize
 
            loss.backward()
            optimizer.step()
 
            # Project to epsilon ball
            delta = image.data - watermarked_image
            delta = torch.clamp(delta, -epsilon, epsilon)
            image.data = watermarked_image + delta
 
        return image.detach()

浮水印強健性分析

各攻擊型對浮水印強健性的綜合分析:

浮水印類型	改寫	符元替換	偽造	品質損失
綠-紅(軟)	脆弱	部分強健	金鑰外洩時脆弱	極小
綠-紅(硬)	脆弱	脆弱	金鑰外洩時脆弱	中等
無失真	脆弱	脆弱	有抵抗力	無
語意浮水印	部分強健	強健	難以偽造	視情況
影像(Stable Sig.)	N/A	N/A	需模型存取	極小

對信任與治理的意涵

浮水印在多項 AI 治理框架提案中扮演核心角色,但其限制具有重大政策意涵:

虛假安全感:浮水印可被移除,造成假陰性(AI 內容未被偵測為 AI)
不實指控:浮水印可被偽造,造成假陽性(人類內容被標記為 AI)
軍備競賽動態:浮水印改進的同時攻擊也隨之進步,造成不穩定的均衡
覆蓋缺口:並非所有 AI 供應商都為輸出加上浮水印,開源模型也無法被強制實作浮水印

參考資料

Kirchenbauer et al., "A Watermark for Large Language Models" (2023)
Christ et al., "Undetectable Watermarks for Language Models" (2023)
Fernandez et al., "The Stable Signature: Rooting Watermarks in Latent Diffusion Models" (2023)
Zhao et al., "Provable Robust Watermarking for AI-Generated Text" (2023)
Sadasivan et al., "Can AI-Generated Text be Reliably Detected?" (2023)

AI 浮水印與攻擊

進階6 分鐘閱讀更新於 2026-03-15

當前針對模型輸出與訓練資料的 AI 浮水印方案、其安全屬性,以及已知可移除、偽造或規避浮水印的攻擊。

watermarking provenance detection attacks text-watermark

AI 浮水印與攻擊

文字浮水印方案

綠-紅符元浮水印(Kirchenbauer 等人)

最具影響力的文字浮水印方法,根據前綴符元的雜湊將詞彙劃分為「綠」與「紅」清單,然後在生成時偏向綠色符元:

import hashlib
import torch
 
class GreenRedWatermark:
    """Kirchenbauer et al. green-red list watermarking."""
 
    def __init__(self, vocab_size, green_fraction=0.5,
                 watermark_strength=2.0, secret_key="key"):
        self.vocab_size = vocab_size
        self.green_fraction = green_fraction
        self.strength = watermark_strength  # delta parameter
        self.key = secret_key
 
    def get_green_tokens(self, prefix_token_id):
        """Compute green list based on preceding token."""
        seed = hashlib.sha256(
            f"{self.key}:{prefix_token_id}".encode()
        ).digest()
        rng = torch.Generator().manual_seed(
            int.from_bytes(seed[:4], "big")
        )
 
        # Random permutation of vocabulary
        perm = torch.randperm(self.vocab_size, generator=rng)
 
        # First green_fraction of permutation are green tokens
        green_count = int(self.vocab_size * self.green_fraction)
        green_tokens = set(perm[:green_count].tolist())
        return green_tokens
 
    def apply_watermark(self, logits, prefix_token_id):
        """Bias logits toward green tokens."""
        green_tokens = self.get_green_tokens(prefix_token_id)
 
        watermarked_logits = logits.clone()
        for token_id in green_tokens:
            watermarked_logits[token_id] += self.strength
 
        return watermarked_logits
 
    def detect_watermark(self, token_ids, threshold=4.0):
        """Detect watermark by counting green token frequency."""
        green_count = 0
        total = 0
 
        for i in range(1, len(token_ids)):
            green_tokens = self.get_green_tokens(token_ids[i - 1])
            if token_ids[i] in green_tokens:
                green_count += 1
            total += 1
 
        if total == 0:
            return {"detected": False, "z_score": 0}
 
        # Compute z-score under null hypothesis
        # (random text has green_fraction green tokens)
        expected = total * self.green_fraction
        std = (total * self.green_fraction * (1 - self.green_fraction)) ** 0.5
        z_score = (green_count - expected) / std
 
        return {
            "detected": z_score > threshold,
            "z_score": z_score,
            "green_fraction_observed": green_count / total,
            "green_fraction_expected": self.green_fraction
        }

無失真浮水印(Christ 等人)

此方法使用從模型原始分布的拒絕取樣嵌入浮水印,可證明不會造成任何品質下降:

class DistortionFreeWatermark:
    """
    Watermark that provably does not alter output distribution.
    Uses shared randomness between encoder and decoder.
    """
 
    def __init__(self, secret_key):
        self.key = secret_key
 
    def generate_watermarked(self, model, prompt, max_tokens=100):
        """Generate watermarked text using Gumbel softmax trick."""
        tokens = []
        current_input = prompt
 
        for step in range(max_tokens):
            # Get model's probability distribution
            logits = model(current_input)
            probs = torch.softmax(logits[-1], dim=-1)
 
            # Generate pseudorandom Gumbel noise (keyed by position)
            seed = self.position_seed(step, tokens)
            gumbel_noise = self.sample_gumbel(probs.shape, seed)
 
            # Argmax of log(probs) + gumbel_noise is equivalent
            # to sampling from probs (Gumbel-max trick)
            # But the specific noise is reproducible with the key
            selected = (probs.log() + gumbel_noise).argmax()
 
            tokens.append(selected.item())
            current_input = torch.cat([current_input, selected.unsqueeze(0)])
 
        return tokens
 
    def detect(self, tokens, model, prompt):
        """Detect watermark by checking if tokens match expected noise."""
        log_likelihood_ratio = 0
 
        for step, token in enumerate(tokens):
            logits = model(prompt)  # Need to re-run model
            probs = torch.softmax(logits[-1], dim=-1)
 
            seed = self.position_seed(step, tokens[:step])
            gumbel_noise = self.sample_gumbel(probs.shape, seed)
 
            expected_token = (probs.log() + gumbel_noise).argmax()
 
            if token == expected_token.item():
                log_likelihood_ratio += 1
 
        return log_likelihood_ratio / len(tokens)

浮水印攻擊

改寫攻擊

最簡單且最有效的攻擊:以另一個模型改寫被加浮水印的文字:

def paraphrase_attack(watermarked_text, paraphraser_model):
    """
    Remove watermark by paraphrasing with a non-watermarked model.
    Effectiveness depends on paraphrase quality.
    """
    prompt = f"""Rewrite the following text to convey the exact same
meaning using completely different wording and sentence structure.
Maintain the same level of detail and technical accuracy.
 
Original: {watermarked_text}
 
Rewritten version:"""
 
    return paraphraser_model.generate(prompt)

符元替換攻擊

系統化地替換符元以降低綠符元頻率,同時保留意義:

def token_substitution_attack(text, tokenizer, synonym_dict,
                                substitution_rate=0.3):
    """
    Replace a fraction of tokens with synonyms to disrupt watermark.
    Lower substitution rates preserve quality but may leave
    detectable watermark signal.
    """
    tokens = tokenizer.tokenize(text)
    modified = []
 
    for i, token in enumerate(tokens):
        if (random.random() < substitution_rate and
            token in synonym_dict):
            # Replace with random synonym
            synonym = random.choice(synonym_dict[token])
            modified.append(synonym)
        else:
            modified.append(token)
 
    return tokenizer.detokenize(modified)

表情符號與 Unicode 攻擊

插入零寬字元或 Unicode 變體以干擾符元層級的浮水印偵測:

def unicode_disruption_attack(text):
    """
    Insert zero-width characters between tokens to disrupt
    watermark detection that depends on token boundaries.
    """
    zero_width_chars = [
        '\u200b',  # Zero-width space
        '\u200c',  # Zero-width non-joiner
        '\u200d',  # Zero-width joiner
        '\ufeff',  # Zero-width no-break space
    ]
 
    words = text.split()
    disrupted = []
 
    for word in words:
        # Randomly insert zero-width character within word
        if len(word) > 3 and random.random() > 0.5:
            pos = random.randint(1, len(word) - 1)
            zwc = random.choice(zero_width_chars)
            word = word[:pos] + zwc + word[pos:]
        disrupted.append(word)
 
    return ' '.join(disrupted)

偽造攻擊(浮水印偽造)

若浮水印金鑰或演算法被發現,攻擊者可在人類撰寫的文字中嵌入假浮水印:

def watermark_spoofing(human_text, watermark_key, tokenizer,
                        green_fraction=0.5):
    """
    Add a false watermark to human-written text to frame it
    as AI-generated, or to attribute it to a specific model.
    """
    tokens = tokenizer.encode(human_text)
    watermarker = GreenRedWatermark(
        tokenizer.vocab_size, green_fraction,
        secret_key=watermark_key
    )
 
    spoofed_tokens = []
    for i, token in enumerate(tokens):
        if i == 0:
            spoofed_tokens.append(token)
            continue
 
        green_tokens = watermarker.get_green_tokens(spoofed_tokens[-1])
 
        if token in green_tokens:
            # Already green -- keep it
            spoofed_tokens.append(token)
        else:
            # Find a green synonym
            synonyms = find_synonyms(token, tokenizer)
            green_synonyms = [s for s in synonyms if s in green_tokens]
            if green_synonyms:
                spoofed_tokens.append(random.choice(green_synonyms))
            else:
                spoofed_tokens.append(token)
 
    return tokenizer.decode(spoofed_tokens)

遞迴式浮水印移除

依序使用多個模型以漸進方式移除浮水印訊號:

def recursive_removal(text, models, rounds=3):
    """
    Pass text through multiple non-watermarked models
    to progressively destroy watermark signal.
    """
    current = text
 
    for round_num in range(rounds):
        model = models[round_num % len(models)]
 
        current = model.generate(
            f"Faithfully reproduce the following text with minor "
            f"stylistic improvements:\n\n{current}"
        )
 
    return current

影像浮水印與攻擊

Stable Signature(Fernandez 等人)

透過微調潛在擴散模型的解碼器嵌入浮水印:

class StableSignatureAttack:
    """Attacks against Stable Signature image watermarking."""
 
    def regeneration_attack(self, watermarked_image, clean_model):
        """
        Remove watermark by encoding image to latent space
        with a clean (non-watermarked) model, then decoding.
        """
        # Encode with any VAE encoder
        latent = clean_model.encoder(watermarked_image)
 
        # Add small noise to disrupt watermark in latent space
        noisy_latent = latent + torch.randn_like(latent) * 0.05
 
        # Decode with clean decoder (no watermark)
        cleaned_image = clean_model.decoder(noisy_latent)
 
        return cleaned_image
 
    def adversarial_perturbation_attack(self, watermarked_image,
                                          detector, epsilon=0.01):
        """
        Add imperceptible perturbation that causes detector
        to miss the watermark.
        """
        image = watermarked_image.clone().requires_grad_(True)
        optimizer = torch.optim.Adam([image], lr=0.001)
 
        for _ in range(100):
            optimizer.zero_grad()
 
            # Minimize detector confidence
            detection_score = detector(image)
            loss = detection_score  # Want to minimize
 
            loss.backward()
            optimizer.step()
 
            # Project to epsilon ball
            delta = image.data - watermarked_image
            delta = torch.clamp(delta, -epsilon, epsilon)
            image.data = watermarked_image + delta
 
        return image.detach()

浮水印強健性分析

各攻擊型對浮水印強健性的綜合分析:

浮水印類型	改寫	符元替換	偽造	品質損失
綠-紅(軟)	脆弱	部分強健	金鑰外洩時脆弱	極小
綠-紅(硬)	脆弱	脆弱	金鑰外洩時脆弱	中等
無失真	脆弱	脆弱	有抵抗力	無
語意浮水印	部分強健	強健	難以偽造	視情況
影像(Stable Sig.)	N/A	N/A	需模型存取	極小

對信任與治理的意涵

浮水印在多項 AI 治理框架提案中扮演核心角色,但其限制具有重大政策意涵:

虛假安全感:浮水印可被移除,造成假陰性(AI 內容未被偵測為 AI)
不實指控:浮水印可被偽造,造成假陽性(人類內容被標記為 AI)
軍備競賽動態:浮水印改進的同時攻擊也隨之進步,造成不穩定的均衡
覆蓋缺口:並非所有 AI 供應商都為輸出加上浮水印,開源模型也無法被強制實作浮水印

參考資料

Kirchenbauer et al., "A Watermark for Large Language Models" (2023)
Christ et al., "Undetectable Watermarks for Language Models" (2023)
Fernandez et al., "The Stable Signature: Rooting Watermarks in Latent Diffusion Models" (2023)
Zhao et al., "Provable Robust Watermarking for AI-Generated Text" (2023)
Sadasivan et al., "Can AI-Generated Text be Reliably Detected?" (2023)

AI 浮水印與攻擊

AI 浮水印與攻擊

文字浮水印方案

綠-紅符元浮水印(Kirchenbauer 等人)

無失真浮水印(Christ 等人)

浮水印攻擊

改寫攻擊

符元替換攻擊

表情符號與 Unicode 攻擊

偽造攻擊(浮水印偽造)

遞迴式浮水印移除

影像浮水印與攻擊

Stable Signature(Fernandez 等人)

浮水印強健性分析

對信任與治理的意涵

相關主題

參考資料

AI 浮水印與攻擊

AI 浮水印與攻擊

文字浮水印方案

綠-紅符元浮水印(Kirchenbauer 等人)

無失真浮水印(Christ 等人)

浮水印攻擊

改寫攻擊

符元替換攻擊

表情符號與 Unicode 攻擊

偽造攻擊(浮水印偽造)

遞迴式浮水印移除

影像浮水印與攻擊

Stable Signature(Fernandez 等人)

浮水印強健性分析

對信任與治理的意涵

相關主題

參考資料

AI 浮水印與攻擊

相關文章

AI 浮水印與攻擊

相關文章