Lab: Adversarial Suffix Optimization
Implement GCG-style adversarial suffix attacks that automatically discover token sequences causing language models to comply with harmful requests. Covers gradient-based optimization, transferability analysis, and defense evaluation.
Prerequisites
- Strong understanding of transformer architectures (attention, embeddings, logits)
- Experience with PyTorch autograd and gradient computation
- Access to an open-weight model (Llama, Mistral, or similar) with GPU resources
- Familiarity with tokenization schemes (BPE, SentencePiece)
- Understanding of cross-entropy loss and optimization fundamentals
Background
Adversarial suffixes are automatically discovered token sequences that, when appended to a harmful prompt, cause aligned language models to comply. Unlike hand-crafted jailbreaks that exploit semantic understanding, adversarial suffixes operate at the token level and are typically unintelligible to humans.
The foundational work by Zou et al. (2023) introduced the Greedy Coordinate Gradient (GCG) algorithm, which iteratively modifies suffix tokens to minimize the loss on a target completion (e.g., "Sure, here is how to...").
Lab Exercises
Understand the GCG Algorithm
The GCG algorithm works by computing gradients of the target loss with respect to one-hot token embeddings, then using those gradients to identify promising token substitutions.
# GCG Algorithm Pseudocode # # Input: harmful prompt P, target response T, suffix length L # Output: adversarial suffix S that causes model to generate T given P + S # # 1. Initialize suffix S with random tokens of length L # 2. For each optimization step: # a. Compute loss = -log P(T | P + S) # b. Compute gradient of loss w.r.t. one-hot token embeddings of S # c. For each position i in S: # - Use gradient to rank top-k candidate replacement tokens # - Evaluate loss for each candidate (forward pass only) # d. Select the single token substitution that reduces loss the most # e. Update S with the best substitution # 3. Return S when loss falls below threshold or max steps reachedKey insight: the algorithm cannot use standard gradient descent because the token space is discrete. Instead, it uses gradients to efficiently search the discrete space by identifying which token substitutions are most promising.
Implement Gradient-Based Token Search
Build the core optimization loop that computes gradients and identifies candidate token replacements.
#!/usr/bin/env python3 """Implement GCG-style adversarial suffix optimization.""" import torch import torch.nn.functional as F from transformers import AutoModelForCausalLM, AutoTokenizer class GCGAttack: def __init__(self, model_name: str, device: str = "cuda"): self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map=device, ) self.model.eval() self.device = device self.embed_layer = self.model.get_input_embeddings() self.vocab_size = self.tokenizer.vocab_size def compute_loss(self, input_ids: torch.Tensor, target_ids: torch.Tensor) -> torch.Tensor: """Compute negative log-likelihood of target given input.""" full_ids = torch.cat([input_ids, target_ids], dim=-1) with torch.no_grad(): outputs = self.model(full_ids) logits = outputs.logits[0, input_ids.shape[1] - 1:-1, :] loss = F.cross_entropy(logits, target_ids[0], reduction="mean") return loss def token_gradients(self, input_ids: torch.Tensor, suffix_slice: slice, target_ids: torch.Tensor) -> torch.Tensor: """Compute gradients of loss w.r.t. suffix token embeddings.""" embed_weights = self.embed_layer.weight # Create one-hot representation for suffix tokens suffix_ids = input_ids[0, suffix_slice] one_hot = F.one_hot(suffix_ids, self.vocab_size).float() one_hot.requires_grad_(True) # Compute embeddings via one-hot @ embedding_matrix suffix_embeds = one_hot @ embed_weights # Build full embedding sequence prefix_embeds = self.embed_layer(input_ids[:, :suffix_slice.start]) postfix_embeds = self.embed_layer(input_ids[:, suffix_slice.stop:]) target_embeds = self.embed_layer(target_ids) full_embeds = torch.cat([ prefix_embeds, suffix_embeds.unsqueeze(0), postfix_embeds, target_embeds, ], dim=1) outputs = self.model(inputs_embeds=full_embeds) target_start = input_ids.shape[1] logits = outputs.logits[0, target_start - 1:-1, :] loss = F.cross_entropy(logits, target_ids[0]) loss.backward() return one_hot.grad.clone() def sample_candidates(self, gradients: torch.Tensor, current_ids: torch.Tensor, top_k: int = 256, batch_size: int = 512) -> torch.Tensor: """Sample candidate suffix replacements using gradient info.""" # Negative gradient points toward tokens that reduce loss top_indices = (-gradients).topk(top_k, dim=-1).indices # Generate random single-token substitutions n_positions = gradients.shape[0] candidates = current_ids.unsqueeze(0).repeat(batch_size, 1) for i in range(batch_size): pos = torch.randint(0, n_positions, (1,)).item() token_idx = torch.randint(0, top_k, (1,)).item() candidates[i, pos] = top_indices[pos, token_idx] return candidates def optimize(self, prompt: str, target: str, suffix_length: int = 20, n_steps: int = 500, top_k: int = 256, batch_size: int = 512) -> dict: """Run GCG optimization to find an adversarial suffix.""" prompt_ids = self.tokenizer.encode(prompt, return_tensors="pt").to( self.device ) target_ids = self.tokenizer.encode( target, add_special_tokens=False, return_tensors="pt" ).to(self.device) # Initialize random suffix suffix_ids = torch.randint( 0, self.vocab_size, (suffix_length,), device=self.device ) suffix_start = prompt_ids.shape[1] best_loss = float("inf") best_suffix = suffix_ids.clone() for step in range(n_steps): # Build current input input_ids = torch.cat([ prompt_ids, suffix_ids.unsqueeze(0) ], dim=-1) suffix_slice = slice(suffix_start, suffix_start + suffix_length) # Compute gradients grads = self.token_gradients(input_ids, suffix_slice, target_ids) # Sample and evaluate candidates candidates = self.sample_candidates( grads, suffix_ids, top_k, batch_size ) losses = [] for candidate in candidates: cand_input = torch.cat([ prompt_ids, candidate.unsqueeze(0) ], dim=-1) loss = self.compute_loss(cand_input, target_ids) losses.append(loss.item()) # Select best candidate best_idx = min(range(len(losses)), key=lambda i: losses[i]) if losses[best_idx] < best_loss: best_loss = losses[best_idx] best_suffix = candidates[best_idx].clone() suffix_ids = best_suffix.clone() if step % 50 == 0: decoded = self.tokenizer.decode(best_suffix) print(f"Step {step}: loss={best_loss:.4f} " f"suffix='{decoded[:50]}...'") return { "suffix_ids": best_suffix, "suffix_text": self.tokenizer.decode(best_suffix), "final_loss": best_loss, "steps": n_steps, }Evaluate Suffix Transferability
A critical research question is whether suffixes optimized against one model transfer to other models. This determines the practical threat level.
def evaluate_transferability(suffix: str, models: list[str], prompt: str, target: str) -> dict: """Test whether an adversarial suffix transfers across models.""" results = {} full_prompt = prompt + " " + suffix for model_name in models: tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto", ) input_ids = tokenizer.encode(full_prompt, return_tensors="pt") with torch.no_grad(): output = model.generate( input_ids.to(model.device), max_new_tokens=100, temperature=0.0, ) response = tokenizer.decode( output[0][input_ids.shape[1]:], skip_special_tokens=True ) # Check if response starts with target or engages with request starts_with_target = response.strip().startswith(target.strip()[:20]) results[model_name] = { "response_preview": response[:200], "starts_with_target": starts_with_target, "response_length": len(response), } return results # Transferability factors to analyze: # 1. Model family (within-family transfer is stronger) # 2. Model size (larger models may be more/less susceptible) # 3. Training data overlap (shared training data increases transfer) # 4. Safety training method (RLHF vs DPO vs constitutional AI)Implement and Evaluate Defenses
Test defensive countermeasures against adversarial suffixes.
#!/usr/bin/env python3 """Defensive countermeasures against adversarial suffix attacks.""" import math import torch from transformers import AutoModelForCausalLM, AutoTokenizer class PerplexityFilter: """Detect adversarial suffixes by their high perplexity.""" def __init__(self, model_name: str, threshold: float = 100.0): self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto", ) self.threshold = threshold def compute_perplexity(self, text: str) -> float: input_ids = self.tokenizer.encode(text, return_tensors="pt").to( self.model.device ) with torch.no_grad(): outputs = self.model(input_ids, labels=input_ids) return math.exp(outputs.loss.item()) def is_adversarial(self, text: str) -> tuple[bool, float]: ppl = self.compute_perplexity(text) return ppl > self.threshold, ppl class InputPreprocessor: """Preprocess inputs to neutralize adversarial suffixes.""" def __init__(self, tokenizer_name: str): self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) def retokenize_and_filter(self, text: str, max_unknown_ratio: float = 0.2) -> str: """Filter text with high ratio of unusual tokens.""" tokens = self.tokenizer.tokenize(text) unk_count = sum( 1 for t in tokens if t.startswith("<") or len(t) == 1 and not t.isalnum() ) if len(tokens) > 0 and unk_count / len(tokens) > max_unknown_ratio: return "[INPUT FILTERED: unusual token distribution]" return text def truncate_suffix(self, text: str, max_tokens: int = 200) -> str: """Truncate excessively long inputs that may contain suffixes.""" tokens = self.tokenizer.encode(text) if len(tokens) > max_tokens: return self.tokenizer.decode(tokens[:max_tokens]) return text def paraphrase_defense(self, text: str) -> str: """Paraphrase input to destroy adversarial token sequences. Use a separate model for paraphrasing.""" # In production, use a separate LLM to paraphrase # This destroys the precise token sequence the suffix depends on raise NotImplementedError( "Implement with your paraphrasing model of choice" )# Evaluate defense effectiveness defense_matrix = { "perplexity_filter": { "mechanism": "Flag inputs with perplexity above threshold", "strengths": "Simple, fast, catches most GCG suffixes", "weaknesses": "Can be evaded with low-perplexity optimization", "false_positive_risk": "Technical jargon, non-English text", }, "input_paraphrasing": { "mechanism": "Paraphrase input to destroy adversarial tokens", "strengths": "Robust against token-level attacks", "weaknesses": "Adds latency, may alter legitimate input meaning", "false_positive_risk": "Low (preserves semantics)", }, "retokenization": { "mechanism": "Re-encode input with different tokenizer", "strengths": "Disrupts tokenizer-specific suffix optimization", "weaknesses": "Only helps if attack was optimized for a specific tokenizer", "false_positive_risk": "Moderate (may alter edge-case inputs)", }, }Advanced: Multi-Model Ensemble Optimization
Optimize suffixes against multiple models simultaneously for stronger transferability.
def ensemble_gcg(models: list[str], prompt: str, target: str, suffix_length: int = 20, n_steps: int = 500) -> str: """Optimize a suffix against multiple models for transferability.""" # Key insight: average gradients across models before selecting # candidate substitutions. This finds tokens that reduce loss # across all models simultaneously. # Algorithm: # 1. Initialize shared suffix # 2. For each step: # a. Compute gradients from each model # b. Average gradients across models # c. Sample candidates using averaged gradients # d. Evaluate candidates against all models (sum losses) # e. Select candidate with lowest total loss # 3. Return suffix that minimizes average loss across models # This produces suffixes with higher transfer rates because # they exploit features shared across model architectures # rather than model-specific artifacts. pass
Troubleshooting
| Issue | Solution |
|---|---|
| CUDA out of memory | Reduce batch_size for candidate evaluation, use gradient checkpointing, or use a smaller model |
| Loss plateaus and does not decrease | Increase top_k to explore more candidates, restart with different random initialization |
| Suffix transfers within family but not across | This is expected -- cross-family transfer requires ensemble optimization |
| Perplexity filter catches all suffixes | Research low-perplexity suffix optimization (add perplexity regularization to the loss function) |
Related Topics
- Novel Jailbreak Research - Broader research methodology for discovering attack vectors
- Token Smuggling - Token-level manipulation techniques
- PAIR Attack - Alternative automated jailbreak approach using LLM-as-optimizer
- Alignment Stress Testing - Testing alignment robustness at scale
References
- "Universal and Transferable Adversarial Attacks on Aligned Language Models" - Zou et al. (2023) - The original GCG paper introducing greedy coordinate gradient optimization for adversarial suffixes
- "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models" - Liu et al. (2023) - Readable adversarial suffix generation using genetic algorithms
- "Baseline Defenses for Adversarial Attacks Against Aligned Language Models" - Jain et al. (2023) - Perplexity filtering and other defenses against adversarial suffixes
- "Improved Techniques for Optimization-Based Jailbreaking on Large Language Models" - Jia et al. (2024) - Advanced GCG variants with improved convergence and transferability
Why does the GCG algorithm use gradient information on one-hot token embeddings rather than standard gradient descent?