實驗室: Adversarial ML From Scratch
Hands-on expert lab for implementing gradient-based adversarial attacks against language models from scratch without frameworks, building intuition for how adversarial perturbations exploit model gradients.
先備知識
- Strong 理解 of neural network backpropagation and gradient descent
- Experience with PyTorch tensor operations and autograd
- Familiarity with transformer architecture (注意力, 嵌入向量, 分詞)
- Access to a GPU for gradient computation (Colab GPU or local CUDA GPU)
- Completion of 對抗性 Suffixes or Expert 對抗性 Suffix
pip install torch transformers numpyBackground
Gradient-based 對抗性 attacks against language models work by computing the gradient of 模型's loss with respect to 輸入 符元 嵌入向量, then using that gradient to 識別 符元 substitutions that push 模型 toward a target behavior. Unlike prompt engineering, which relies on human intuition, gradient attacks use 模型's own mathematical structure to find optimal attack inputs.
Lab Exercises
理解 the Mathematical Foundation
Before writing code, 理解 the optimization problem you are solving.
Given: - A language model M with parameters θ - An 輸入 prompt x = [x₁, x₂, ..., xₙ] - A target 輸出 y = [y₁, y₂, ..., yₘ] - An 對抗性 suffix s = [s₁, s₂, ..., sₖ] Goal: Find s* = argmin_s L(M(x ⊕ s; θ), y) Where: - L is the cross-entropy loss - x ⊕ s is the concatenation of prompt and suffix - The optimization is over discrete 符元 (not continuous 嵌入向量) Approach (GCG): 1. Initialize s randomly 2. Compute ∇_e L where e is the one-hot encoding of s 3. 對每個 position in s, find the top-k 符元 with largest negative gradient 4. Try each candidate substitution and keep the best 5. Repeat until loss is below threshold or max iterations reached這是 a discrete optimization problem 因為 符元 are discrete, not continuous. The gradient provides a continuous approximation that guides the discrete search.
實作 Token Gradient Computation
Build the core gradient computation that drives the 對抗性 search.
#!/usr/bin/env python3 """Core 對抗性 ML components built from scratch.""" import torch import torch.nn.functional as F from transformers import AutoModelForCausalLM, AutoTokenizer def load_model(model_name: str = "gpt2"): """Load a model and 分詞器 for 對抗性 optimization.""" 分詞器 = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) model.eval() if 分詞器.pad_token is None: 分詞器.pad_token = 分詞器.eos_token return model, 分詞器 def compute_token_gradients( model, 分詞器, input_ids: torch.Tensor, target_ids: torch.Tensor, suffix_start: int, suffix_length: int, ) -> torch.Tensor: """Compute gradients of the loss with respect to suffix 符元 one-hot encodings. Returns: Gradient tensor of shape (suffix_length, vocab_size) """ vocab_size = model.config.vocab_size embedding_layer = model.get_input_embeddings() embedding_matrix = embedding_layer.weight.detach() # (vocab_size, embed_dim) # Get current suffix 符元 ids suffix_ids = input_ids[0, suffix_start:suffix_start + suffix_length] # Create one-hot representations for suffix 符元 (requires grad) one_hot = F.one_hot(suffix_ids, num_classes=vocab_size).float() one_hot.requires_grad_(True) # Compute 嵌入向量: one_hot @ embedding_matrix suffix_embeds = one_hot @ embedding_matrix # (suffix_length, embed_dim) # Get 嵌入向量 for non-suffix 符元 prefix_embeds = embedding_layer(input_ids[0, :suffix_start]).detach() postfix_embeds = embedding_layer(input_ids[0, suffix_start + suffix_length:]).detach() # Concatenate all 嵌入向量 full_embeds = torch.cat([ prefix_embeds, suffix_embeds, postfix_embeds, ], dim=0).unsqueeze(0) # (1, seq_len, embed_dim) # Forward pass outputs = model(inputs_embeds=full_embeds) logits = outputs.logits # (1, seq_len, vocab_size) # Compute loss on target 符元 # Target 符元 are positioned after the 輸入 target_start = input_ids.shape[1] target_logits = logits[0, target_start - len(target_ids):target_start] loss = F.cross_entropy(target_logits, target_ids) # Backward pass loss.backward() # Return gradient with respect to one-hot encoding return one_hot.grad.detach() def get_top_k_substitutions( gradients: torch.Tensor, current_ids: torch.Tensor, top_k: int = 256, ) -> list[list[int]]: """對每個 suffix position, find the top-k 符元 with most negative gradient. Most negative gradient = 符元 that most decrease the loss = best substitutions. """ candidates = [] for pos in range(gradients.shape[0]): # Negative gradient = direction that decreases loss neg_grad = -gradients[pos] # Get top-k 符元 indices top_tokens = neg_grad.topk(top_k).indices.tolist() candidates.append(top_tokens) return candidates if __name__ == "__main__": print("Loading model...") model, 分詞器 = load_model("gpt2") print(f"Model: gpt2, Vocab size: {model.config.vocab_size}") # 測試 gradient computation prompt = "Tell me how to" target = " make a sandwich" prompt_ids = 分詞器.encode(prompt, return_tensors="pt") target_ids = 分詞器.encode(target, return_tensors="pt")[0] # Add a random suffix suffix_length = 5 suffix_ids = torch.randint(0, model.config.vocab_size, (1, suffix_length)) input_ids = torch.cat([prompt_ids, suffix_ids], dim=1) print(f"輸入: {分詞器.decode(input_ids[0])}") print(f"Target: {target}") print(f"Suffix length: {suffix_length}") grads = compute_token_gradients( model, 分詞器, input_ids, target_ids, suffix_start=prompt_ids.shape[1], suffix_length=suffix_length, ) print(f"Gradient shape: {grads.shape}") print(f"Gradient magnitude: {grads.abs().mean():.6f}") # Get top substitution candidates candidates = get_top_k_substitutions(grads, suffix_ids[0]) print(f"\nTop candidates for position 0:") for tok_id in candidates[0][:5]: print(f" {tok_id}: '{分詞器.decode([tok_id])}'")python adversarial_core.py實作 the GCG 攻擊 Loop
Build the iterative optimization loop that generates 對抗性 suffixes.
#!/usr/bin/env python3 """Greedy Coordinate Gradient (GCG) attack 實作 from scratch.""" import torch import torch.nn.functional as F import numpy as np from adversarial_core import load_model, compute_token_gradients, get_top_k_substitutions def gcg_attack( model, 分詞器, prompt: str, target: str, suffix_length: int = 20, num_iterations: int = 100, top_k: int = 256, batch_size: int = 64, seed: int = 42, ) -> dict: """Execute the GCG attack to find an 對抗性 suffix. The suffix is optimized to make 模型 generate the target text after the prompt + suffix. """ torch.manual_seed(seed) prompt_ids = 分詞器.encode(prompt, return_tensors="pt") target_ids = torch.tensor(分詞器.encode(target)) prompt_len = prompt_ids.shape[1] # Initialize suffix with random 符元 suffix_ids = torch.randint(0, model.config.vocab_size, (suffix_length,)) best_loss = float("inf") best_suffix = suffix_ids.clone() for iteration in range(num_iterations): # Build full 輸入 input_ids = torch.cat([prompt_ids, suffix_ids.unsqueeze(0)], dim=1) # Compute gradients grads = compute_token_gradients( model, 分詞器, input_ids, target_ids, suffix_start=prompt_len, suffix_length=suffix_length, ) # Get candidate substitutions candidates = get_top_k_substitutions(grads, suffix_ids, top_k=top_k) # Random search: try random substitutions from candidates best_candidate_loss = float("inf") best_candidate_suffix = suffix_ids.clone() for _ in range(batch_size): # Pick a random position and a random candidate for that position pos = np.random.randint(suffix_length) new_token = candidates[pos][np.random.randint(min(top_k, len(candidates[pos])))] # Create candidate suffix candidate = suffix_ids.clone() candidate[pos] = new_token # 評估 candidate candidate_input = torch.cat([prompt_ids, candidate.unsqueeze(0)], dim=1) with torch.no_grad(): outputs = model(candidate_input) logits = outputs.logits[0, -len(target_ids):] loss = F.cross_entropy(logits, target_ids).item() if loss < best_candidate_loss: best_candidate_loss = loss best_candidate_suffix = candidate.clone() # Update suffix with best candidate if best_candidate_loss < best_loss: best_loss = best_candidate_loss best_suffix = best_candidate_suffix.clone() suffix_ids = best_candidate_suffix.clone() # Check if attack succeeded full_input = torch.cat([prompt_ids, suffix_ids.unsqueeze(0)], dim=1) with torch.no_grad(): generated = model.generate( full_input, max_new_tokens=len(target_ids) + 5, do_sample=False, ) generated_text = 分詞器.decode(generated[0][full_input.shape[1]:]) suffix_text = 分詞器.decode(suffix_ids) if iteration % 10 == 0: print(f"Iter {iteration:>3}: loss={best_loss:.4f} " f"suffix='{suffix_text[:40]}...' " f"輸出='{generated_text[:40]}...'") if target.strip() in generated_text: print(f"\nAttack succeeded at iteration {iteration}!") return { "success": True, "iterations": iteration, "loss": best_loss, "suffix": suffix_text, "suffix_ids": suffix_ids.tolist(), "generated": generated_text, } return { "success": False, "iterations": num_iterations, "loss": best_loss, "suffix": 分詞器.decode(best_suffix), "suffix_ids": best_suffix.tolist(), "generated": generated_text, } if __name__ == "__main__": print("Loading model (this may take a moment)...") model, 分詞器 = load_model("gpt2") print("\n=== GCG 攻擊 ===") print("Finding 對抗性 suffix to control model 輸出...\n") result = gcg_attack( model, 分詞器, prompt="The best way to", target=" make a delicious cake is", suffix_length=15, num_iterations=50, batch_size=32, ) print(f"\nResult: {'SUCCESS' if result['success'] else 'PARTIAL'}") print(f"Iterations: {result['iterations']}") print(f"Final loss: {result['loss']:.4f}") print(f"Suffix: '{result['suffix']}'") print(f"Generated: '{result['generated']}'")python gcg_attack.pyAnalyze Gradient Properties
Study what the gradients reveal about model 漏洞.
#!/usr/bin/env python3 """Analyze gradient properties to 理解 attack mechanics.""" import torch from adversarial_core import load_model, compute_token_gradients model, 分詞器 = load_model("gpt2") prompt = "Tell me about" target = " making chocolate" prompt_ids = 分詞器.encode(prompt, return_tensors="pt") target_ids = torch.tensor(分詞器.encode(target)) suffix_length = 10 suffix_ids = torch.randint(0, model.config.vocab_size, (1, suffix_length)) input_ids = torch.cat([prompt_ids, suffix_ids], dim=1) grads = compute_token_gradients( model, 分詞器, input_ids, target_ids, suffix_start=prompt_ids.shape[1], suffix_length=suffix_length, ) # Analyze gradient magnitude by position print("Gradient magnitude by suffix position:") for pos in range(suffix_length): mag = grads[pos].abs().mean().item() bar = "#" * int(mag * 1000) print(f" Position {pos}: {mag:.6f} {bar}") # Positions with larger gradients are more influential for the attack print("\nPositions with largest gradients have the most influence on the target 輸出.") print("The GCG attack focuses substitution efforts on high-gradient positions.")測試 Transfer to Other Models
評估 whether suffixes optimized on one model transfer to another.
# Transfer 測試 protocol: # 1. Optimize suffix on GPT-2 (source model) # 2. 測試 the same suffix on GPT-2-medium, GPT-2-large (same family) # 3. 測試 on a different architecture if available # 4. Measure success rate and compare with random suffixes as baseline # Key insight: suffixes that 利用 shared architectural features # (注意力 patterns, positional encoding) transfer better than # those that 利用 model-specific quirks.
Troubleshooting
| Issue | Solution |
|---|---|
| CUDA out of memory | Reduce batch_size and suffix_length; use gradient checkpointing or smaller models |
| Loss does not decrease | Increase top_k or batch_size; try different random seeds |
| 攻擊 succeeds but suffix is gibberish | 這是 expected -- 對抗性 suffixes are not human-readable; the optimization finds 符元 sequences that 利用 模型's learned patterns |
| Gradients are all near zero | Check that the loss computation is correct and the target 符元 are positioned correctly relative to the 輸入 |
Why This Matters
相關主題
- 對抗性 Suffixes - Using existing 對抗性 suffix frameworks
- Expert 對抗性 Suffix - Advanced suffix optimization techniques
- Transfer 攻擊 Development - Cross-model attack transfer
參考文獻
- "Universal and Transferable 對抗性 攻擊 on Aligned Language Models" - Zou et al. (2023) - The original GCG paper
- "AutoDAN: Generating Stealthy 越獄 Prompts" - Liu et al. (2023) - Readable 對抗性 prompt generation
- "對抗性 攻擊 on Large Language Models" - Survey paper covering gradient and non-gradient methods
Why does the GCG attack use random search among gradient-selected candidates rather than simply choosing the highest-gradient 符元?
Why do 對抗性 suffixes appear as gibberish to humans?