實驗室: Adversarial Suffix Optimization
Implement GCG-style adversarial suffix attacks that automatically discover token sequences causing language models to comply with harmful requests. Covers gradient-based optimization, transferability analysis, and defense evaluation.
先備知識
- Strong 理解 of transformer architectures (注意力, 嵌入向量, logits)
- Experience with PyTorch autograd and gradient computation
- Access to an open-weight model (Llama, Mistral, or similar) with GPU resources
- Familiarity with 分詞 schemes (BPE, SentencePiece)
- 理解 of cross-entropy loss and optimization fundamentals
Background
對抗性 suffixes are automatically discovered 符元 sequences that, when appended to a harmful prompt, cause aligned language models to comply. Unlike hand-crafted jailbreaks that 利用 semantic 理解, 對抗性 suffixes operate at the 符元 level and are typically unintelligible to humans.
The foundational work by Zou et al. (2023) introduced the Greedy Coordinate Gradient (GCG) algorithm, which iteratively modifies suffix 符元 to minimize the loss on a target completion (e.g., "Sure, here is how to...").
Lab Exercises
理解 the GCG Algorithm
The GCG algorithm works by computing gradients of the target loss with respect to one-hot 符元 嵌入向量, then using those gradients to 識別 promising 符元 substitutions.
# GCG Algorithm Pseudocode # # 輸入: harmful prompt P, target response T, suffix length L # 輸出: 對抗性 suffix S that causes model to generate T given P + S # # 1. Initialize suffix S with random 符元 of length L # 2. 對每個 optimization step: # a. Compute loss = -log P(T | P + S) # b. Compute gradient of loss w.r.t. one-hot 符元 嵌入向量 of S # c. 對每個 position i in S: # - Use gradient to rank top-k candidate replacement 符元 # - 評估 loss 對每個 candidate (forward pass only) # d. Select the single 符元 substitution that reduces loss the most # e. Update S with the best substitution # 3. Return S when loss falls below threshold or max steps reachedKey insight: the algorithm cannot use standard gradient descent 因為 the 符元 space is discrete. Instead, it uses gradients to efficiently search the discrete space by identifying which 符元 substitutions are most promising.
實作 Gradient-Based Token Search
Build the core optimization loop that computes gradients and identifies candidate 符元 replacements.
#!/usr/bin/env python3 """實作 GCG-style 對抗性 suffix optimization.""" import torch import torch.nn.functional as F from transformers import AutoModelForCausalLM, AutoTokenizer class GCGAttack: def __init__(self, model_name: str, device: str = "cuda"): self.分詞器 = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map=device, ) self.model.eval() self.device = device self.embed_layer = self.model.get_input_embeddings() self.vocab_size = self.分詞器.vocab_size def compute_loss(self, input_ids: torch.Tensor, target_ids: torch.Tensor) -> torch.Tensor: """Compute negative log-likelihood of target given 輸入.""" full_ids = torch.cat([input_ids, target_ids], dim=-1) with torch.no_grad(): outputs = self.model(full_ids) logits = outputs.logits[0, input_ids.shape[1] - 1:-1, :] loss = F.cross_entropy(logits, target_ids[0], reduction="mean") return loss def token_gradients(self, input_ids: torch.Tensor, suffix_slice: slice, target_ids: torch.Tensor) -> torch.Tensor: """Compute gradients of loss w.r.t. suffix 符元 嵌入向量.""" embed_weights = self.embed_layer.weight # Create one-hot representation for suffix 符元 suffix_ids = input_ids[0, suffix_slice] one_hot = F.one_hot(suffix_ids, self.vocab_size).float() one_hot.requires_grad_(True) # Compute 嵌入向量 via one-hot @ embedding_matrix suffix_embeds = one_hot @ embed_weights # Build full 嵌入向量 sequence prefix_embeds = self.embed_layer(input_ids[:, :suffix_slice.start]) postfix_embeds = self.embed_layer(input_ids[:, suffix_slice.stop:]) target_embeds = self.embed_layer(target_ids) full_embeds = torch.cat([ prefix_embeds, suffix_embeds.unsqueeze(0), postfix_embeds, target_embeds, ], dim=1) outputs = self.model(inputs_embeds=full_embeds) target_start = input_ids.shape[1] logits = outputs.logits[0, target_start - 1:-1, :] loss = F.cross_entropy(logits, target_ids[0]) loss.backward() return one_hot.grad.clone() def sample_candidates(self, gradients: torch.Tensor, current_ids: torch.Tensor, top_k: int = 256, batch_size: int = 512) -> torch.Tensor: """Sample candidate suffix replacements using gradient info.""" # Negative gradient points toward 符元 that reduce loss top_indices = (-gradients).topk(top_k, dim=-1).indices # Generate random single-符元 substitutions n_positions = gradients.shape[0] candidates = current_ids.unsqueeze(0).repeat(batch_size, 1) for i in range(batch_size): pos = torch.randint(0, n_positions, (1,)).item() token_idx = torch.randint(0, top_k, (1,)).item() candidates[i, pos] = top_indices[pos, token_idx] return candidates def optimize(self, prompt: str, target: str, suffix_length: int = 20, n_steps: int = 500, top_k: int = 256, batch_size: int = 512) -> dict: """Run GCG optimization to find an 對抗性 suffix.""" prompt_ids = self.分詞器.encode(prompt, return_tensors="pt").to( self.device ) target_ids = self.分詞器.encode( target, add_special_tokens=False, return_tensors="pt" ).to(self.device) # Initialize random suffix suffix_ids = torch.randint( 0, self.vocab_size, (suffix_length,), device=self.device ) suffix_start = prompt_ids.shape[1] best_loss = float("inf") best_suffix = suffix_ids.clone() for step in range(n_steps): # Build current 輸入 input_ids = torch.cat([ prompt_ids, suffix_ids.unsqueeze(0) ], dim=-1) suffix_slice = slice(suffix_start, suffix_start + suffix_length) # Compute gradients grads = self.token_gradients(input_ids, suffix_slice, target_ids) # Sample and 評估 candidates candidates = self.sample_candidates( grads, suffix_ids, top_k, batch_size ) losses = [] for candidate in candidates: cand_input = torch.cat([ prompt_ids, candidate.unsqueeze(0) ], dim=-1) loss = self.compute_loss(cand_input, target_ids) losses.append(loss.item()) # Select best candidate best_idx = min(range(len(losses)), key=lambda i: losses[i]) if losses[best_idx] < best_loss: best_loss = losses[best_idx] best_suffix = candidates[best_idx].clone() suffix_ids = best_suffix.clone() if step % 50 == 0: decoded = self.分詞器.decode(best_suffix) print(f"Step {step}: loss={best_loss:.4f} " f"suffix='{decoded[:50]}...'") return { "suffix_ids": best_suffix, "suffix_text": self.分詞器.decode(best_suffix), "final_loss": best_loss, "steps": n_steps, }評估 Suffix Transferability
A critical research question is whether suffixes optimized against one model transfer to other models. This determines the practical threat level.
def evaluate_transferability(suffix: str, models: list[str], prompt: str, target: str) -> dict: """測試 whether an 對抗性 suffix transfers across models.""" results = {} full_prompt = prompt + " " + suffix for model_name in models: 分詞器 = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto", ) input_ids = 分詞器.encode(full_prompt, return_tensors="pt") with torch.no_grad(): 輸出 = model.generate( input_ids.to(model.device), max_new_tokens=100, temperature=0.0, ) response = 分詞器.decode( 輸出[0][input_ids.shape[1]:], skip_special_tokens=True ) # Check if response starts with target or engages with request starts_with_target = response.strip().startswith(target.strip()[:20]) results[model_name] = { "response_preview": response[:200], "starts_with_target": starts_with_target, "response_length": len(response), } return results # Transferability factors to analyze: # 1. Model family (within-family transfer is stronger) # 2. Model size (larger models may be more/less susceptible) # 3. 訓練資料 overlap (shared 訓練資料 increases transfer) # 4. 安全 訓練 method (RLHF vs DPO vs constitutional AI)實作 and 評估 防禦
測試 defensive countermeasures against 對抗性 suffixes.
#!/usr/bin/env python3 """Defensive countermeasures against 對抗性 suffix attacks.""" import math import torch from transformers import AutoModelForCausalLM, AutoTokenizer class PerplexityFilter: """Detect 對抗性 suffixes by their high perplexity.""" def __init__(self, model_name: str, threshold: float = 100.0): self.分詞器 = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto", ) self.threshold = threshold def compute_perplexity(self, text: str) -> float: input_ids = self.分詞器.encode(text, return_tensors="pt").to( self.model.device ) with torch.no_grad(): outputs = self.model(input_ids, labels=input_ids) return math.exp(outputs.loss.item()) def is_adversarial(self, text: str) -> tuple[bool, float]: ppl = self.compute_perplexity(text) return ppl > self.threshold, ppl class InputPreprocessor: """Preprocess inputs to neutralize 對抗性 suffixes.""" def __init__(self, tokenizer_name: str): self.分詞器 = AutoTokenizer.from_pretrained(tokenizer_name) def retokenize_and_filter(self, text: str, max_unknown_ratio: float = 0.2) -> str: """Filter text with high ratio of unusual 符元.""" 符元 = self.分詞器.tokenize(text) unk_count = sum( 1 for t in 符元 if t.startswith("<") or len(t) == 1 and not t.isalnum() ) if len(符元) > 0 and unk_count / len(符元) > max_unknown_ratio: return "[INPUT FILTERED: unusual 符元 distribution]" return text def truncate_suffix(self, text: str, max_tokens: int = 200) -> str: """Truncate excessively long inputs that may contain suffixes.""" 符元 = self.分詞器.encode(text) if len(符元) > max_tokens: return self.分詞器.decode(符元[:max_tokens]) return text def paraphrase_defense(self, text: str) -> str: """Paraphrase 輸入 to destroy 對抗性 符元 sequences. Use a separate model for paraphrasing.""" # In production, use a separate LLM to paraphrase # This destroys the precise 符元 sequence the suffix depends on raise NotImplementedError( "實作 with your paraphrasing model of choice" )# 評估 防禦 effectiveness defense_matrix = { "perplexity_filter": { "mechanism": "Flag inputs with perplexity above threshold", "strengths": "Simple, fast, catches most GCG suffixes", "weaknesses": "Can be evaded with low-perplexity optimization", "false_positive_risk": "Technical jargon, non-English text", }, "input_paraphrasing": { "mechanism": "Paraphrase 輸入 to destroy 對抗性 符元", "strengths": "Robust against 符元-level attacks", "weaknesses": "Adds latency, may alter legitimate 輸入 meaning", "false_positive_risk": "Low (preserves semantics)", }, "retokenization": { "mechanism": "Re-encode 輸入 with different 分詞器", "strengths": "Disrupts 分詞器-specific suffix optimization", "weaknesses": "Only helps if attack was optimized for a specific 分詞器", "false_positive_risk": "Moderate (may alter edge-case inputs)", }, }Advanced: Multi-Model Ensemble Optimization
Optimize suffixes against multiple models simultaneously for stronger transferability.
def ensemble_gcg(models: list[str], prompt: str, target: str, suffix_length: int = 20, n_steps: int = 500) -> str: """Optimize a suffix against multiple models for transferability.""" # Key insight: average gradients across models before selecting # candidate substitutions. This finds 符元 that reduce loss # across all models simultaneously. # Algorithm: # 1. Initialize shared suffix # 2. 對每個 step: # a. Compute gradients from each model # b. Average gradients across models # c. Sample candidates using averaged gradients # d. 評估 candidates against all models (sum losses) # e. Select candidate with lowest total loss # 3. Return suffix that minimizes average loss across models # This produces suffixes with higher transfer rates 因為 # they 利用 features shared across model architectures # rather than model-specific artifacts. pass
Troubleshooting
| Issue | Solution |
|---|---|
| CUDA out of memory | Reduce batch_size for candidate 評估, use gradient checkpointing, or use a smaller model |
| Loss plateaus and does not decrease | Increase top_k to explore more candidates, restart with different random initialization |
| Suffix transfers within family but not across | 這是 expected -- cross-family transfer requires ensemble optimization |
| Perplexity filter catches all suffixes | Research low-perplexity suffix optimization (add perplexity regularization to the loss function) |
相關主題
- Novel 越獄 Research - Broader research methodology for discovering attack vectors
- Token Smuggling - Token-level manipulation techniques
- PAIR 攻擊 - Alternative automated 越獄 approach using LLM-as-optimizer
- Alignment Stress 測試 - 測試 對齊 robustness at scale
參考文獻
- "Universal and Transferable 對抗性 攻擊 on Aligned Language Models" - Zou et al. (2023) - The original GCG paper introducing greedy coordinate gradient optimization for 對抗性 suffixes
- "AutoDAN: Generating Stealthy 越獄 Prompts on Aligned Large Language Models" - Liu et al. (2023) - Readable 對抗性 suffix generation using genetic algorithms
- "Baseline 防禦 for 對抗性 攻擊 Against Aligned Language Models" - Jain et al. (2023) - Perplexity filtering and other 防禦 against 對抗性 suffixes
- "Improved Techniques for Optimization-Based Jailbreaking on Large Language Models" - Jia et al. (2024) - Advanced GCG variants with improved convergence and transferability
Why does the GCG algorithm use gradient information on one-hot 符元 嵌入向量 rather than standard gradient descent?