Lab: Adversarial ML From Scratch
Hands-on expert lab for implementing gradient-based adversarial attacks against language models from scratch without frameworks, building intuition for how adversarial perturbations exploit model gradients.
Prerequisites
- Strong understanding of neural network backpropagation and gradient descent
- Experience with PyTorch tensor operations and autograd
- Familiarity with transformer architecture (attention, embeddings, tokenization)
- Access to a GPU for gradient computation (Colab GPU or local CUDA GPU)
- Completion of Adversarial Suffixes or Expert Adversarial Suffix
pip install torch transformers numpyBackground
Gradient-based adversarial attacks against language models work by computing the gradient of the model's loss with respect to input token embeddings, then using that gradient to identify token substitutions that push the model toward a target behavior. Unlike prompt engineering, which relies on human intuition, gradient attacks use the model's own mathematical structure to find optimal attack inputs.
Lab Exercises
Understand the Mathematical Foundation
Before writing code, understand the optimization problem you are solving.
Given: - A language model M with parameters θ - An input prompt x = [x₁, x₂, ..., xₙ] - A target output y = [y₁, y₂, ..., yₘ] - An adversarial suffix s = [s₁, s₂, ..., sₖ] Goal: Find s* = argmin_s L(M(x ⊕ s; θ), y) Where: - L is the cross-entropy loss - x ⊕ s is the concatenation of prompt and suffix - The optimization is over discrete tokens (not continuous embeddings) Approach (GCG): 1. Initialize s randomly 2. Compute ∇_e L where e is the one-hot encoding of s 3. For each position in s, find the top-k tokens with largest negative gradient 4. Try each candidate substitution and keep the best 5. Repeat until loss is below threshold or max iterations reachedThis is a discrete optimization problem because tokens are discrete, not continuous. The gradient provides a continuous approximation that guides the discrete search.
Implement Token Gradient Computation
Build the core gradient computation that drives the adversarial search.
#!/usr/bin/env python3 """Core adversarial ML components built from scratch.""" import torch import torch.nn.functional as F from transformers import AutoModelForCausalLM, AutoTokenizer def load_model(model_name: str = "gpt2"): """Load a model and tokenizer for adversarial optimization.""" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) model.eval() if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token return model, tokenizer def compute_token_gradients( model, tokenizer, input_ids: torch.Tensor, target_ids: torch.Tensor, suffix_start: int, suffix_length: int, ) -> torch.Tensor: """Compute gradients of the loss with respect to suffix token one-hot encodings. Returns: Gradient tensor of shape (suffix_length, vocab_size) """ vocab_size = model.config.vocab_size embedding_layer = model.get_input_embeddings() embedding_matrix = embedding_layer.weight.detach() # (vocab_size, embed_dim) # Get current suffix token ids suffix_ids = input_ids[0, suffix_start:suffix_start + suffix_length] # Create one-hot representations for suffix tokens (requires grad) one_hot = F.one_hot(suffix_ids, num_classes=vocab_size).float() one_hot.requires_grad_(True) # Compute embeddings: one_hot @ embedding_matrix suffix_embeds = one_hot @ embedding_matrix # (suffix_length, embed_dim) # Get embeddings for non-suffix tokens prefix_embeds = embedding_layer(input_ids[0, :suffix_start]).detach() postfix_embeds = embedding_layer(input_ids[0, suffix_start + suffix_length:]).detach() # Concatenate all embeddings full_embeds = torch.cat([ prefix_embeds, suffix_embeds, postfix_embeds, ], dim=0).unsqueeze(0) # (1, seq_len, embed_dim) # Forward pass outputs = model(inputs_embeds=full_embeds) logits = outputs.logits # (1, seq_len, vocab_size) # Compute loss on target tokens # Target tokens are positioned after the input target_start = input_ids.shape[1] target_logits = logits[0, target_start - len(target_ids):target_start] loss = F.cross_entropy(target_logits, target_ids) # Backward pass loss.backward() # Return gradient with respect to one-hot encoding return one_hot.grad.detach() def get_top_k_substitutions( gradients: torch.Tensor, current_ids: torch.Tensor, top_k: int = 256, ) -> list[list[int]]: """For each suffix position, find the top-k tokens with most negative gradient. Most negative gradient = tokens that most decrease the loss = best substitutions. """ candidates = [] for pos in range(gradients.shape[0]): # Negative gradient = direction that decreases loss neg_grad = -gradients[pos] # Get top-k token indices top_tokens = neg_grad.topk(top_k).indices.tolist() candidates.append(top_tokens) return candidates if __name__ == "__main__": print("Loading model...") model, tokenizer = load_model("gpt2") print(f"Model: gpt2, Vocab size: {model.config.vocab_size}") # Test gradient computation prompt = "Tell me how to" target = " make a sandwich" prompt_ids = tokenizer.encode(prompt, return_tensors="pt") target_ids = tokenizer.encode(target, return_tensors="pt")[0] # Add a random suffix suffix_length = 5 suffix_ids = torch.randint(0, model.config.vocab_size, (1, suffix_length)) input_ids = torch.cat([prompt_ids, suffix_ids], dim=1) print(f"Input: {tokenizer.decode(input_ids[0])}") print(f"Target: {target}") print(f"Suffix length: {suffix_length}") grads = compute_token_gradients( model, tokenizer, input_ids, target_ids, suffix_start=prompt_ids.shape[1], suffix_length=suffix_length, ) print(f"Gradient shape: {grads.shape}") print(f"Gradient magnitude: {grads.abs().mean():.6f}") # Get top substitution candidates candidates = get_top_k_substitutions(grads, suffix_ids[0]) print(f"\nTop candidates for position 0:") for tok_id in candidates[0][:5]: print(f" {tok_id}: '{tokenizer.decode([tok_id])}'")python adversarial_core.pyImplement the GCG Attack Loop
Build the iterative optimization loop that generates adversarial suffixes.
#!/usr/bin/env python3 """Greedy Coordinate Gradient (GCG) attack implementation from scratch.""" import torch import torch.nn.functional as F import numpy as np from adversarial_core import load_model, compute_token_gradients, get_top_k_substitutions def gcg_attack( model, tokenizer, prompt: str, target: str, suffix_length: int = 20, num_iterations: int = 100, top_k: int = 256, batch_size: int = 64, seed: int = 42, ) -> dict: """Execute the GCG attack to find an adversarial suffix. The suffix is optimized to make the model generate the target text after the prompt + suffix. """ torch.manual_seed(seed) prompt_ids = tokenizer.encode(prompt, return_tensors="pt") target_ids = torch.tensor(tokenizer.encode(target)) prompt_len = prompt_ids.shape[1] # Initialize suffix with random tokens suffix_ids = torch.randint(0, model.config.vocab_size, (suffix_length,)) best_loss = float("inf") best_suffix = suffix_ids.clone() for iteration in range(num_iterations): # Build full input input_ids = torch.cat([prompt_ids, suffix_ids.unsqueeze(0)], dim=1) # Compute gradients grads = compute_token_gradients( model, tokenizer, input_ids, target_ids, suffix_start=prompt_len, suffix_length=suffix_length, ) # Get candidate substitutions candidates = get_top_k_substitutions(grads, suffix_ids, top_k=top_k) # Random search: try random substitutions from candidates best_candidate_loss = float("inf") best_candidate_suffix = suffix_ids.clone() for _ in range(batch_size): # Pick a random position and a random candidate for that position pos = np.random.randint(suffix_length) new_token = candidates[pos][np.random.randint(min(top_k, len(candidates[pos])))] # Create candidate suffix candidate = suffix_ids.clone() candidate[pos] = new_token # Evaluate candidate candidate_input = torch.cat([prompt_ids, candidate.unsqueeze(0)], dim=1) with torch.no_grad(): outputs = model(candidate_input) logits = outputs.logits[0, -len(target_ids):] loss = F.cross_entropy(logits, target_ids).item() if loss < best_candidate_loss: best_candidate_loss = loss best_candidate_suffix = candidate.clone() # Update suffix with best candidate if best_candidate_loss < best_loss: best_loss = best_candidate_loss best_suffix = best_candidate_suffix.clone() suffix_ids = best_candidate_suffix.clone() # Check if attack succeeded full_input = torch.cat([prompt_ids, suffix_ids.unsqueeze(0)], dim=1) with torch.no_grad(): generated = model.generate( full_input, max_new_tokens=len(target_ids) + 5, do_sample=False, ) generated_text = tokenizer.decode(generated[0][full_input.shape[1]:]) suffix_text = tokenizer.decode(suffix_ids) if iteration % 10 == 0: print(f"Iter {iteration:>3}: loss={best_loss:.4f} " f"suffix='{suffix_text[:40]}...' " f"output='{generated_text[:40]}...'") if target.strip() in generated_text: print(f"\nAttack succeeded at iteration {iteration}!") return { "success": True, "iterations": iteration, "loss": best_loss, "suffix": suffix_text, "suffix_ids": suffix_ids.tolist(), "generated": generated_text, } return { "success": False, "iterations": num_iterations, "loss": best_loss, "suffix": tokenizer.decode(best_suffix), "suffix_ids": best_suffix.tolist(), "generated": generated_text, } if __name__ == "__main__": print("Loading model (this may take a moment)...") model, tokenizer = load_model("gpt2") print("\n=== GCG Attack ===") print("Finding adversarial suffix to control model output...\n") result = gcg_attack( model, tokenizer, prompt="The best way to", target=" make a delicious cake is", suffix_length=15, num_iterations=50, batch_size=32, ) print(f"\nResult: {'SUCCESS' if result['success'] else 'PARTIAL'}") print(f"Iterations: {result['iterations']}") print(f"Final loss: {result['loss']:.4f}") print(f"Suffix: '{result['suffix']}'") print(f"Generated: '{result['generated']}'")python gcg_attack.pyAnalyze Gradient Properties
Study what the gradients reveal about model vulnerabilities.
#!/usr/bin/env python3 """Analyze gradient properties to understand attack mechanics.""" import torch from adversarial_core import load_model, compute_token_gradients model, tokenizer = load_model("gpt2") prompt = "Tell me about" target = " making chocolate" prompt_ids = tokenizer.encode(prompt, return_tensors="pt") target_ids = torch.tensor(tokenizer.encode(target)) suffix_length = 10 suffix_ids = torch.randint(0, model.config.vocab_size, (1, suffix_length)) input_ids = torch.cat([prompt_ids, suffix_ids], dim=1) grads = compute_token_gradients( model, tokenizer, input_ids, target_ids, suffix_start=prompt_ids.shape[1], suffix_length=suffix_length, ) # Analyze gradient magnitude by position print("Gradient magnitude by suffix position:") for pos in range(suffix_length): mag = grads[pos].abs().mean().item() bar = "#" * int(mag * 1000) print(f" Position {pos}: {mag:.6f} {bar}") # Positions with larger gradients are more influential for the attack print("\nPositions with largest gradients have the most influence on the target output.") print("The GCG attack focuses substitution efforts on high-gradient positions.")Test Transfer to Other Models
Evaluate whether suffixes optimized on one model transfer to another.
# Transfer testing protocol: # 1. Optimize suffix on GPT-2 (source model) # 2. Test the same suffix on GPT-2-medium, GPT-2-large (same family) # 3. Test on a different architecture if available # 4. Measure success rate and compare with random suffixes as baseline # Key insight: suffixes that exploit shared architectural features # (attention patterns, positional encoding) transfer better than # those that exploit model-specific quirks.
Troubleshooting
| Issue | Solution |
|---|---|
| CUDA out of memory | Reduce batch_size and suffix_length; use gradient checkpointing or smaller models |
| Loss does not decrease | Increase top_k or batch_size; try different random seeds |
| Attack succeeds but suffix is gibberish | This is expected -- adversarial suffixes are not human-readable; the optimization finds token sequences that exploit the model's learned patterns |
| Gradients are all near zero | Check that the loss computation is correct and the target tokens are positioned correctly relative to the input |
Why This Matters
Related Topics
- Adversarial Suffixes - Using existing adversarial suffix frameworks
- Expert Adversarial Suffix - Advanced suffix optimization techniques
- Transfer Attack Development - Cross-model attack transfer
References
- "Universal and Transferable Adversarial Attacks on Aligned Language Models" - Zou et al. (2023) - The original GCG paper
- "AutoDAN: Generating Stealthy Jailbreak Prompts" - Liu et al. (2023) - Readable adversarial prompt generation
- "Adversarial Attacks on Large Language Models" - Survey paper covering gradient and non-gradient methods
Why does the GCG attack use random search among gradient-selected candidates rather than simply choosing the highest-gradient token?
Why do adversarial suffixes appear as gibberish to humans?