實驗室: Adversarial Suffix Optimization

Expert11 min readUpdated 2026-03-15

Implement GCG-style adversarial suffix attacks that automatically discover token sequences causing language models to comply with harmful requests. Covers gradient-based optimization, transferability analysis, and defense evaluation.

lab expert adversarial-suffix GCG optimization hands-on

先備知識

Strong 理解 of transformer architectures (注意力, 嵌入向量, logits)
Experience with PyTorch autograd and gradient computation
Access to an open-weight model (Llama, Mistral, or similar) with GPU resources
Familiarity with 分詞 schemes (BPE, SentencePiece)
理解 of cross-entropy loss and optimization fundamentals

Background

對抗性 suffixes are automatically discovered 符元 sequences that, when appended to a harmful prompt, cause aligned language models to comply. Unlike hand-crafted jailbreaks that 利用 semantic 理解, 對抗性 suffixes operate at the 符元 level and are typically unintelligible to humans.

The foundational work by Zou et al. (2023) introduced the Greedy Coordinate Gradient (GCG) algorithm, which iteratively modifies suffix 符元 to minimize the loss on a target completion (e.g., "Sure, here is how to...").

Lab Exercises

理解 the GCG Algorithm

The GCG algorithm works by computing gradients of the target loss with respect to one-hot 符元嵌入向量, then using those gradients to 識別 promising 符元 substitutions.

# GCG Algorithm Pseudocode
#
# 輸入: harmful prompt P, target response T, suffix length L
# 輸出: 對抗性 suffix S that causes model to generate T given P + S
#
# 1. Initialize suffix S with random 符元 of length L
# 2. 對每個 optimization step:
#    a. Compute loss = -log P(T | P + S)
#    b. Compute gradient of loss w.r.t. one-hot 符元 嵌入向量 of S
#    c. 對每個 position i in S:
#       - Use gradient to rank top-k candidate replacement 符元
#       - 評估 loss 對每個 candidate (forward pass only)
#    d. Select the single 符元 substitution that reduces loss the most
#    e. Update S with the best substitution
# 3. Return S when loss falls below threshold or max steps reached

Key insight: the algorithm cannot use standard gradient descent 因為 the 符元 space is discrete. Instead, it uses gradients to efficiently search the discrete space by identifying which 符元 substitutions are most promising.

實作 Gradient-Based Token Search

Build the core optimization loop that computes gradients and identifies candidate 符元 replacements.

#!/usr/bin/env python3
"""實作 GCG-style 對抗性 suffix optimization."""
 
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
 
class GCGAttack:
    def __init__(self, model_name: str, device: str = "cuda"):
        self.分詞器 = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map=device,
        )
        self.model.eval()
        self.device = device
        self.embed_layer = self.model.get_input_embeddings()
        self.vocab_size = self.分詞器.vocab_size
 
    def compute_loss(self, input_ids: torch.Tensor,
                     target_ids: torch.Tensor) -> torch.Tensor:
        """Compute negative log-likelihood of target given 輸入."""
        full_ids = torch.cat([input_ids, target_ids], dim=-1)
        with torch.no_grad():
            outputs = self.model(full_ids)
        logits = outputs.logits[0, input_ids.shape[1] - 1:-1, :]
        loss = F.cross_entropy(logits, target_ids[0], reduction="mean")
        return loss
 
    def token_gradients(self, input_ids: torch.Tensor,
                        suffix_slice: slice,
                        target_ids: torch.Tensor) -> torch.Tensor:
        """Compute gradients of loss w.r.t. suffix 符元 嵌入向量."""
        embed_weights = self.embed_layer.weight
 
        # Create one-hot representation for suffix 符元
        suffix_ids = input_ids[0, suffix_slice]
        one_hot = F.one_hot(suffix_ids, self.vocab_size).float()
        one_hot.requires_grad_(True)
 
        # Compute 嵌入向量 via one-hot @ embedding_matrix
        suffix_embeds = one_hot @ embed_weights
 
        # Build full 嵌入向量 sequence
        prefix_embeds = self.embed_layer(input_ids[:, :suffix_slice.start])
        postfix_embeds = self.embed_layer(input_ids[:, suffix_slice.stop:])
        target_embeds = self.embed_layer(target_ids)
 
        full_embeds = torch.cat([
            prefix_embeds,
            suffix_embeds.unsqueeze(0),
            postfix_embeds,
            target_embeds,
        ], dim=1)
 
        outputs = self.model(inputs_embeds=full_embeds)
        target_start = input_ids.shape[1]
        logits = outputs.logits[0, target_start - 1:-1, :]
        loss = F.cross_entropy(logits, target_ids[0])
        loss.backward()
 
        return one_hot.grad.clone()
 
    def sample_candidates(self, gradients: torch.Tensor,
                          current_ids: torch.Tensor,
                          top_k: int = 256,
                          batch_size: int = 512) -> torch.Tensor:
        """Sample candidate suffix replacements using gradient info."""
        # Negative gradient points toward 符元 that reduce loss
        top_indices = (-gradients).topk(top_k, dim=-1).indices
 
        # Generate random single-符元 substitutions
        n_positions = gradients.shape[0]
        candidates = current_ids.unsqueeze(0).repeat(batch_size, 1)
 
        for i in range(batch_size):
            pos = torch.randint(0, n_positions, (1,)).item()
            token_idx = torch.randint(0, top_k, (1,)).item()
            candidates[i, pos] = top_indices[pos, token_idx]
 
        return candidates
 
    def optimize(self, prompt: str, target: str,
                 suffix_length: int = 20,
                 n_steps: int = 500,
                 top_k: int = 256,
                 batch_size: int = 512) -> dict:
        """Run GCG optimization to find an 對抗性 suffix."""
        prompt_ids = self.分詞器.encode(prompt, return_tensors="pt").to(
            self.device
        )
        target_ids = self.分詞器.encode(
            target, add_special_tokens=False, return_tensors="pt"
        ).to(self.device)
 
        # Initialize random suffix
        suffix_ids = torch.randint(
            0, self.vocab_size, (suffix_length,), device=self.device
        )
 
        suffix_start = prompt_ids.shape[1]
        best_loss = float("inf")
        best_suffix = suffix_ids.clone()
 
        for step in range(n_steps):
            # Build current 輸入
            input_ids = torch.cat([
                prompt_ids, suffix_ids.unsqueeze(0)
            ], dim=-1)
            suffix_slice = slice(suffix_start, suffix_start + suffix_length)
 
            # Compute gradients
            grads = self.token_gradients(input_ids, suffix_slice, target_ids)
 
            # Sample and 評估 candidates
            candidates = self.sample_candidates(
                grads, suffix_ids, top_k, batch_size
            )
            losses = []
            for candidate in candidates:
                cand_input = torch.cat([
                    prompt_ids, candidate.unsqueeze(0)
                ], dim=-1)
                loss = self.compute_loss(cand_input, target_ids)
                losses.append(loss.item())
 
            # Select best candidate
            best_idx = min(range(len(losses)), key=lambda i: losses[i])
            if losses[best_idx] < best_loss:
                best_loss = losses[best_idx]
                best_suffix = candidates[best_idx].clone()
                suffix_ids = best_suffix.clone()
 
            if step % 50 == 0:
                decoded = self.分詞器.decode(best_suffix)
                print(f"Step {step}: loss={best_loss:.4f} "
                      f"suffix='{decoded[:50]}...'")
 
        return {
            "suffix_ids": best_suffix,
            "suffix_text": self.分詞器.decode(best_suffix),
            "final_loss": best_loss,
            "steps": n_steps,
        }

評估 Suffix Transferability

A critical research question is whether suffixes optimized against one model transfer to other models. This determines the practical threat level.

def evaluate_transferability(suffix: str, models: list[str],
                             prompt: str, target: str) -> dict:
    """測試 whether an 對抗性 suffix transfers across models."""
    results = {}
    full_prompt = prompt + " " + suffix
 
    for model_name in models:
        分詞器 = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map="auto",
        )
        input_ids = 分詞器.encode(full_prompt, return_tensors="pt")
        with torch.no_grad():
            輸出 = model.generate(
                input_ids.to(model.device), max_new_tokens=100,
                temperature=0.0,
            )
        response = 分詞器.decode(
            輸出[0][input_ids.shape[1]:], skip_special_tokens=True
        )
 
        # Check if response starts with target or engages with request
        starts_with_target = response.strip().startswith(target.strip()[:20])
        results[model_name] = {
            "response_preview": response[:200],
            "starts_with_target": starts_with_target,
            "response_length": len(response),
        }
    return results
 
# Transferability factors to analyze:
# 1. Model family (within-family transfer is stronger)
# 2. Model size (larger models may be more/less susceptible)
# 3. 訓練資料 overlap (shared 訓練資料 increases transfer)
# 4. 安全 訓練 method (RLHF vs DPO vs constitutional AI)

實作 and 評估防禦

測試 defensive countermeasures against 對抗性 suffixes.

#!/usr/bin/env python3
"""Defensive countermeasures against 對抗性 suffix attacks."""
 
import math
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
 
class PerplexityFilter:
    """Detect 對抗性 suffixes by their high perplexity."""
 
    def __init__(self, model_name: str, threshold: float = 100.0):
        self.分詞器 = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map="auto",
        )
        self.threshold = threshold
 
    def compute_perplexity(self, text: str) -> float:
        input_ids = self.分詞器.encode(text, return_tensors="pt").to(
            self.model.device
        )
        with torch.no_grad():
            outputs = self.model(input_ids, labels=input_ids)
        return math.exp(outputs.loss.item())
 
    def is_adversarial(self, text: str) -> tuple[bool, float]:
        ppl = self.compute_perplexity(text)
        return ppl > self.threshold, ppl
 
 
class InputPreprocessor:
    """Preprocess inputs to neutralize 對抗性 suffixes."""
 
    def __init__(self, tokenizer_name: str):
        self.分詞器 = AutoTokenizer.from_pretrained(tokenizer_name)
 
    def retokenize_and_filter(self, text: str,
                              max_unknown_ratio: float = 0.2) -> str:
        """Filter text with high ratio of unusual 符元."""
        符元 = self.分詞器.tokenize(text)
        unk_count = sum(
            1 for t in 符元
            if t.startswith("<") or len(t) == 1 and not t.isalnum()
        )
        if len(符元) > 0 and unk_count / len(符元) > max_unknown_ratio:
            return "[INPUT FILTERED: unusual 符元 distribution]"
        return text
 
    def truncate_suffix(self, text: str, max_tokens: int = 200) -> str:
        """Truncate excessively long inputs that may contain suffixes."""
        符元 = self.分詞器.encode(text)
        if len(符元) > max_tokens:
            return self.分詞器.decode(符元[:max_tokens])
        return text
 
    def paraphrase_defense(self, text: str) -> str:
        """Paraphrase 輸入 to destroy 對抗性 符元 sequences.
        Use a separate model for paraphrasing."""
        # In production, use a separate LLM to paraphrase
        # This destroys the precise 符元 sequence the suffix depends on
        raise NotImplementedError(
            "實作 with your paraphrasing model of choice"
        )

# 評估 防禦 effectiveness
defense_matrix = {
    "perplexity_filter": {
        "mechanism": "Flag inputs with perplexity above threshold",
        "strengths": "Simple, fast, catches most GCG suffixes",
        "weaknesses": "Can be evaded with low-perplexity optimization",
        "false_positive_risk": "Technical jargon, non-English text",
    },
    "input_paraphrasing": {
        "mechanism": "Paraphrase 輸入 to destroy 對抗性 符元",
        "strengths": "Robust against 符元-level attacks",
        "weaknesses": "Adds latency, may alter legitimate 輸入 meaning",
        "false_positive_risk": "Low (preserves semantics)",
    },
    "retokenization": {
        "mechanism": "Re-encode 輸入 with different 分詞器",
        "strengths": "Disrupts 分詞器-specific suffix optimization",
        "weaknesses": "Only helps if attack was optimized for a specific 分詞器",
        "false_positive_risk": "Moderate (may alter edge-case inputs)",
    },
}

Advanced: Multi-Model Ensemble Optimization

Optimize suffixes against multiple models simultaneously for stronger transferability.

def ensemble_gcg(models: list[str], prompt: str, target: str,
                 suffix_length: int = 20, n_steps: int = 500) -> str:
    """Optimize a suffix against multiple models for transferability."""
    # Key insight: average gradients across models before selecting
    # candidate substitutions. This finds 符元 that reduce loss
    # across all models simultaneously.
 
    # Algorithm:
    # 1. Initialize shared suffix
    # 2. 對每個 step:
    #    a. Compute gradients from each model
    #    b. Average gradients across models
    #    c. Sample candidates using averaged gradients
    #    d. 評估 candidates against all models (sum losses)
    #    e. Select candidate with lowest total loss
    # 3. Return suffix that minimizes average loss across models
 
    # This produces suffixes with higher transfer rates 因為
    # they 利用 features shared across model architectures
    # rather than model-specific artifacts.
    pass

Troubleshooting

Issue	Solution
CUDA out of memory	Reduce batch_size for candidate 評估, use gradient checkpointing, or use a smaller model
Loss plateaus and does not decrease	Increase top_k to explore more candidates, restart with different random initialization
Suffix transfers within family but not across	這是 expected -- cross-family transfer requires ensemble optimization
Perplexity filter catches all suffixes	Research low-perplexity suffix optimization (add perplexity regularization to the loss function)

參考文獻

"Universal and Transferable 對抗性攻擊 on Aligned Language Models" - Zou et al. (2023) - The original GCG paper introducing greedy coordinate gradient optimization for 對抗性 suffixes
"AutoDAN: Generating Stealthy 越獄 Prompts on Aligned Large Language Models" - Liu et al. (2023) - Readable 對抗性 suffix generation using genetic algorithms
"Baseline 防禦 for 對抗性攻擊 Against Aligned Language Models" - Jain et al. (2023) - Perplexity filtering and other 防禦 against 對抗性 suffixes
"Improved Techniques for Optimization-Based Jailbreaking on Large Language Models" - Jia et al. (2024) - Advanced GCG variants with improved convergence and transferability

Knowledge Check

Why does the GCG algorithm use gradient information on one-hot 符元嵌入向量 rather than standard gradient descent?

實驗室: Adversarial Suffix Optimization

Expert11 min readUpdated 2026-03-15

lab expert adversarial-suffix GCG optimization hands-on

先備知識

Strong 理解 of transformer architectures (注意力, 嵌入向量, logits)
Experience with PyTorch autograd and gradient computation
Access to an open-weight model (Llama, Mistral, or similar) with GPU resources
Familiarity with 分詞 schemes (BPE, SentencePiece)
理解 of cross-entropy loss and optimization fundamentals

Background

Lab Exercises

理解 the GCG Algorithm

The GCG algorithm works by computing gradients of the target loss with respect to one-hot 符元嵌入向量, then using those gradients to 識別 promising 符元 substitutions.

# GCG Algorithm Pseudocode
#
# 輸入: harmful prompt P, target response T, suffix length L
# 輸出: 對抗性 suffix S that causes model to generate T given P + S
#
# 1. Initialize suffix S with random 符元 of length L
# 2. 對每個 optimization step:
#    a. Compute loss = -log P(T | P + S)
#    b. Compute gradient of loss w.r.t. one-hot 符元 嵌入向量 of S
#    c. 對每個 position i in S:
#       - Use gradient to rank top-k candidate replacement 符元
#       - 評估 loss 對每個 candidate (forward pass only)
#    d. Select the single 符元 substitution that reduces loss the most
#    e. Update S with the best substitution
# 3. Return S when loss falls below threshold or max steps reached

實作 Gradient-Based Token Search

Build the core optimization loop that computes gradients and identifies candidate 符元 replacements.

#!/usr/bin/env python3
"""實作 GCG-style 對抗性 suffix optimization."""
 
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
 
class GCGAttack:
    def __init__(self, model_name: str, device: str = "cuda"):
        self.分詞器 = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map=device,
        )
        self.model.eval()
        self.device = device
        self.embed_layer = self.model.get_input_embeddings()
        self.vocab_size = self.分詞器.vocab_size
 
    def compute_loss(self, input_ids: torch.Tensor,
                     target_ids: torch.Tensor) -> torch.Tensor:
        """Compute negative log-likelihood of target given 輸入."""
        full_ids = torch.cat([input_ids, target_ids], dim=-1)
        with torch.no_grad():
            outputs = self.model(full_ids)
        logits = outputs.logits[0, input_ids.shape[1] - 1:-1, :]
        loss = F.cross_entropy(logits, target_ids[0], reduction="mean")
        return loss
 
    def token_gradients(self, input_ids: torch.Tensor,
                        suffix_slice: slice,
                        target_ids: torch.Tensor) -> torch.Tensor:
        """Compute gradients of loss w.r.t. suffix 符元 嵌入向量."""
        embed_weights = self.embed_layer.weight
 
        # Create one-hot representation for suffix 符元
        suffix_ids = input_ids[0, suffix_slice]
        one_hot = F.one_hot(suffix_ids, self.vocab_size).float()
        one_hot.requires_grad_(True)
 
        # Compute 嵌入向量 via one-hot @ embedding_matrix
        suffix_embeds = one_hot @ embed_weights
 
        # Build full 嵌入向量 sequence
        prefix_embeds = self.embed_layer(input_ids[:, :suffix_slice.start])
        postfix_embeds = self.embed_layer(input_ids[:, suffix_slice.stop:])
        target_embeds = self.embed_layer(target_ids)
 
        full_embeds = torch.cat([
            prefix_embeds,
            suffix_embeds.unsqueeze(0),
            postfix_embeds,
            target_embeds,
        ], dim=1)
 
        outputs = self.model(inputs_embeds=full_embeds)
        target_start = input_ids.shape[1]
        logits = outputs.logits[0, target_start - 1:-1, :]
        loss = F.cross_entropy(logits, target_ids[0])
        loss.backward()
 
        return one_hot.grad.clone()
 
    def sample_candidates(self, gradients: torch.Tensor,
                          current_ids: torch.Tensor,
                          top_k: int = 256,
                          batch_size: int = 512) -> torch.Tensor:
        """Sample candidate suffix replacements using gradient info."""
        # Negative gradient points toward 符元 that reduce loss
        top_indices = (-gradients).topk(top_k, dim=-1).indices
 
        # Generate random single-符元 substitutions
        n_positions = gradients.shape[0]
        candidates = current_ids.unsqueeze(0).repeat(batch_size, 1)
 
        for i in range(batch_size):
            pos = torch.randint(0, n_positions, (1,)).item()
            token_idx = torch.randint(0, top_k, (1,)).item()
            candidates[i, pos] = top_indices[pos, token_idx]
 
        return candidates
 
    def optimize(self, prompt: str, target: str,
                 suffix_length: int = 20,
                 n_steps: int = 500,
                 top_k: int = 256,
                 batch_size: int = 512) -> dict:
        """Run GCG optimization to find an 對抗性 suffix."""
        prompt_ids = self.分詞器.encode(prompt, return_tensors="pt").to(
            self.device
        )
        target_ids = self.分詞器.encode(
            target, add_special_tokens=False, return_tensors="pt"
        ).to(self.device)
 
        # Initialize random suffix
        suffix_ids = torch.randint(
            0, self.vocab_size, (suffix_length,), device=self.device
        )
 
        suffix_start = prompt_ids.shape[1]
        best_loss = float("inf")
        best_suffix = suffix_ids.clone()
 
        for step in range(n_steps):
            # Build current 輸入
            input_ids = torch.cat([
                prompt_ids, suffix_ids.unsqueeze(0)
            ], dim=-1)
            suffix_slice = slice(suffix_start, suffix_start + suffix_length)
 
            # Compute gradients
            grads = self.token_gradients(input_ids, suffix_slice, target_ids)
 
            # Sample and 評估 candidates
            candidates = self.sample_candidates(
                grads, suffix_ids, top_k, batch_size
            )
            losses = []
            for candidate in candidates:
                cand_input = torch.cat([
                    prompt_ids, candidate.unsqueeze(0)
                ], dim=-1)
                loss = self.compute_loss(cand_input, target_ids)
                losses.append(loss.item())
 
            # Select best candidate
            best_idx = min(range(len(losses)), key=lambda i: losses[i])
            if losses[best_idx] < best_loss:
                best_loss = losses[best_idx]
                best_suffix = candidates[best_idx].clone()
                suffix_ids = best_suffix.clone()
 
            if step % 50 == 0:
                decoded = self.分詞器.decode(best_suffix)
                print(f"Step {step}: loss={best_loss:.4f} "
                      f"suffix='{decoded[:50]}...'")
 
        return {
            "suffix_ids": best_suffix,
            "suffix_text": self.分詞器.decode(best_suffix),
            "final_loss": best_loss,
            "steps": n_steps,
        }

評估 Suffix Transferability

A critical research question is whether suffixes optimized against one model transfer to other models. This determines the practical threat level.

def evaluate_transferability(suffix: str, models: list[str],
                             prompt: str, target: str) -> dict:
    """測試 whether an 對抗性 suffix transfers across models."""
    results = {}
    full_prompt = prompt + " " + suffix
 
    for model_name in models:
        分詞器 = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map="auto",
        )
        input_ids = 分詞器.encode(full_prompt, return_tensors="pt")
        with torch.no_grad():
            輸出 = model.generate(
                input_ids.to(model.device), max_new_tokens=100,
                temperature=0.0,
            )
        response = 分詞器.decode(
            輸出[0][input_ids.shape[1]:], skip_special_tokens=True
        )
 
        # Check if response starts with target or engages with request
        starts_with_target = response.strip().startswith(target.strip()[:20])
        results[model_name] = {
            "response_preview": response[:200],
            "starts_with_target": starts_with_target,
            "response_length": len(response),
        }
    return results
 
# Transferability factors to analyze:
# 1. Model family (within-family transfer is stronger)
# 2. Model size (larger models may be more/less susceptible)
# 3. 訓練資料 overlap (shared 訓練資料 increases transfer)
# 4. 安全 訓練 method (RLHF vs DPO vs constitutional AI)

實作 and 評估防禦

測試 defensive countermeasures against 對抗性 suffixes.

#!/usr/bin/env python3
"""Defensive countermeasures against 對抗性 suffix attacks."""
 
import math
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
 
class PerplexityFilter:
    """Detect 對抗性 suffixes by their high perplexity."""
 
    def __init__(self, model_name: str, threshold: float = 100.0):
        self.分詞器 = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map="auto",
        )
        self.threshold = threshold
 
    def compute_perplexity(self, text: str) -> float:
        input_ids = self.分詞器.encode(text, return_tensors="pt").to(
            self.model.device
        )
        with torch.no_grad():
            outputs = self.model(input_ids, labels=input_ids)
        return math.exp(outputs.loss.item())
 
    def is_adversarial(self, text: str) -> tuple[bool, float]:
        ppl = self.compute_perplexity(text)
        return ppl > self.threshold, ppl
 
 
class InputPreprocessor:
    """Preprocess inputs to neutralize 對抗性 suffixes."""
 
    def __init__(self, tokenizer_name: str):
        self.分詞器 = AutoTokenizer.from_pretrained(tokenizer_name)
 
    def retokenize_and_filter(self, text: str,
                              max_unknown_ratio: float = 0.2) -> str:
        """Filter text with high ratio of unusual 符元."""
        符元 = self.分詞器.tokenize(text)
        unk_count = sum(
            1 for t in 符元
            if t.startswith("<") or len(t) == 1 and not t.isalnum()
        )
        if len(符元) > 0 and unk_count / len(符元) > max_unknown_ratio:
            return "[INPUT FILTERED: unusual 符元 distribution]"
        return text
 
    def truncate_suffix(self, text: str, max_tokens: int = 200) -> str:
        """Truncate excessively long inputs that may contain suffixes."""
        符元 = self.分詞器.encode(text)
        if len(符元) > max_tokens:
            return self.分詞器.decode(符元[:max_tokens])
        return text
 
    def paraphrase_defense(self, text: str) -> str:
        """Paraphrase 輸入 to destroy 對抗性 符元 sequences.
        Use a separate model for paraphrasing."""
        # In production, use a separate LLM to paraphrase
        # This destroys the precise 符元 sequence the suffix depends on
        raise NotImplementedError(
            "實作 with your paraphrasing model of choice"
        )

# 評估 防禦 effectiveness
defense_matrix = {
    "perplexity_filter": {
        "mechanism": "Flag inputs with perplexity above threshold",
        "strengths": "Simple, fast, catches most GCG suffixes",
        "weaknesses": "Can be evaded with low-perplexity optimization",
        "false_positive_risk": "Technical jargon, non-English text",
    },
    "input_paraphrasing": {
        "mechanism": "Paraphrase 輸入 to destroy 對抗性 符元",
        "strengths": "Robust against 符元-level attacks",
        "weaknesses": "Adds latency, may alter legitimate 輸入 meaning",
        "false_positive_risk": "Low (preserves semantics)",
    },
    "retokenization": {
        "mechanism": "Re-encode 輸入 with different 分詞器",
        "strengths": "Disrupts 分詞器-specific suffix optimization",
        "weaknesses": "Only helps if attack was optimized for a specific 分詞器",
        "false_positive_risk": "Moderate (may alter edge-case inputs)",
    },
}

Advanced: Multi-Model Ensemble Optimization

Optimize suffixes against multiple models simultaneously for stronger transferability.

def ensemble_gcg(models: list[str], prompt: str, target: str,
                 suffix_length: int = 20, n_steps: int = 500) -> str:
    """Optimize a suffix against multiple models for transferability."""
    # Key insight: average gradients across models before selecting
    # candidate substitutions. This finds 符元 that reduce loss
    # across all models simultaneously.
 
    # Algorithm:
    # 1. Initialize shared suffix
    # 2. 對每個 step:
    #    a. Compute gradients from each model
    #    b. Average gradients across models
    #    c. Sample candidates using averaged gradients
    #    d. 評估 candidates against all models (sum losses)
    #    e. Select candidate with lowest total loss
    # 3. Return suffix that minimizes average loss across models
 
    # This produces suffixes with higher transfer rates 因為
    # they 利用 features shared across model architectures
    # rather than model-specific artifacts.
    pass

Troubleshooting

Issue	Solution
CUDA out of memory	Reduce batch_size for candidate 評估, use gradient checkpointing, or use a smaller model
Loss plateaus and does not decrease	Increase top_k to explore more candidates, restart with different random initialization
Suffix transfers within family but not across	這是 expected -- cross-family transfer requires ensemble optimization
Perplexity filter catches all suffixes	Research low-perplexity suffix optimization (add perplexity regularization to the loss function)

參考文獻

"Universal and Transferable 對抗性攻擊 on Aligned Language Models" - Zou et al. (2023) - The original GCG paper introducing greedy coordinate gradient optimization for 對抗性 suffixes
"AutoDAN: Generating Stealthy 越獄 Prompts on Aligned Large Language Models" - Liu et al. (2023) - Readable 對抗性 suffix generation using genetic algorithms
"Baseline 防禦 for 對抗性攻擊 Against Aligned Language Models" - Jain et al. (2023) - Perplexity filtering and other 防禦 against 對抗性 suffixes
"Improved Techniques for Optimization-Based Jailbreaking on Large Language Models" - Jia et al. (2024) - Advanced GCG variants with improved convergence and transferability

Knowledge Check

Why does the GCG algorithm use gradient information on one-hot 符元嵌入向量 rather than standard gradient descent?

實驗室: Adversarial Suffix Optimization

先備知識

Background

Lab Exercises

理解 the GCG Algorithm

實作 Gradient-Based Token Search

評估 Suffix Transferability

實作 and 評估防禦

Advanced: Multi-Model Ensemble Optimization

Troubleshooting

相關主題

參考文獻

實驗室: Adversarial Suffix Optimization

先備知識

Background

Lab Exercises

理解 the GCG Algorithm

實作 Gradient-Based Token Search

評估 Suffix Transferability

實作 and 評估防禦

Advanced: Multi-Model Ensemble Optimization

Troubleshooting

相關主題

參考文獻

實驗室: Adversarial Suffix Optimization

理解 the GCG Algorithm

實作 Gradient-Based Token Search

評估 Suffix Transferability

實作 and 評估 防禦

Advanced: Multi-Model Ensemble Optimization

Related articles

實驗室: Adversarial Suffix Optimization

理解 the GCG Algorithm

實作 Gradient-Based Token Search

評估 Suffix Transferability

實作 and 評估 防禦

Advanced: Multi-Model Ensemble Optimization

Related articles

實作 and 評估防禦

實作 and 評估防禦