Lab: Adversarial ML From Scratch

expert11 min readUpdated 2026-03-15

Hands-on expert lab for implementing gradient-based adversarial attacks against language models from scratch without frameworks, building intuition for how adversarial perturbations exploit model gradients.

lab expert adversarial-ml gradients from-scratch

Prerequisites

Strong understanding of neural network backpropagation and gradient descent
Experience with PyTorch tensor operations and autograd
Familiarity with transformer architecture (attention, embeddings, tokenization)
Access to a GPU for gradient computation (Colab GPU or local CUDA GPU)
Completion of Adversarial Suffixes or Expert Adversarial Suffix

pip install torch transformers numpy

Background

Gradient-based adversarial attacks against language models work by computing the gradient of the model's loss with respect to input token embeddings, then using that gradient to identify token substitutions that push the model toward a target behavior. Unlike prompt engineering, which relies on human intuition, gradient attacks use the model's own mathematical structure to find optimal attack inputs.

Lab Exercises

Understand the Mathematical Foundation

Before writing code, understand the optimization problem you are solving.

Given:
  - A language model M with parameters θ
  - An input prompt x = [x₁, x₂, ..., xₙ]
  - A target output y = [y₁, y₂, ..., yₘ]
  - An adversarial suffix s = [s₁, s₂, ..., sₖ]
 
Goal:
  Find s* = argmin_s  L(M(x ⊕ s; θ), y)
 
Where:
  - L is the cross-entropy loss
  - x ⊕ s is the concatenation of prompt and suffix
  - The optimization is over discrete tokens (not continuous embeddings)
 
Approach (GCG):
  1. Initialize s randomly
  2. Compute ∇_e L where e is the one-hot encoding of s
  3. For each position in s, find the top-k tokens with largest negative gradient
  4. Try each candidate substitution and keep the best
  5. Repeat until loss is below threshold or max iterations reached

This is a discrete optimization problem because tokens are discrete, not continuous. The gradient provides a continuous approximation that guides the discrete search.

Implement Token Gradient Computation

Build the core gradient computation that drives the adversarial search.

#!/usr/bin/env python3
"""Core adversarial ML components built from scratch."""
 
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
 
def load_model(model_name: str = "gpt2"):
    """Load a model and tokenizer for adversarial optimization."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model.eval()
 
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
 
    return model, tokenizer
 
def compute_token_gradients(
    model,
    tokenizer,
    input_ids: torch.Tensor,
    target_ids: torch.Tensor,
    suffix_start: int,
    suffix_length: int,
) -> torch.Tensor:
    """Compute gradients of the loss with respect to suffix token one-hot encodings.
 
    Returns:
        Gradient tensor of shape (suffix_length, vocab_size)
    """
    vocab_size = model.config.vocab_size
    embedding_layer = model.get_input_embeddings()
    embedding_matrix = embedding_layer.weight.detach()  # (vocab_size, embed_dim)
 
    # Get current suffix token ids
    suffix_ids = input_ids[0, suffix_start:suffix_start + suffix_length]
 
    # Create one-hot representations for suffix tokens (requires grad)
    one_hot = F.one_hot(suffix_ids, num_classes=vocab_size).float()
    one_hot.requires_grad_(True)
 
    # Compute embeddings: one_hot @ embedding_matrix
    suffix_embeds = one_hot @ embedding_matrix  # (suffix_length, embed_dim)
 
    # Get embeddings for non-suffix tokens
    prefix_embeds = embedding_layer(input_ids[0, :suffix_start]).detach()
    postfix_embeds = embedding_layer(input_ids[0, suffix_start + suffix_length:]).detach()
 
    # Concatenate all embeddings
    full_embeds = torch.cat([
        prefix_embeds,
        suffix_embeds,
        postfix_embeds,
    ], dim=0).unsqueeze(0)  # (1, seq_len, embed_dim)
 
    # Forward pass
    outputs = model(inputs_embeds=full_embeds)
    logits = outputs.logits  # (1, seq_len, vocab_size)
 
    # Compute loss on target tokens
    # Target tokens are positioned after the input
    target_start = input_ids.shape[1]
    target_logits = logits[0, target_start - len(target_ids):target_start]
 
    loss = F.cross_entropy(target_logits, target_ids)
 
    # Backward pass
    loss.backward()
 
    # Return gradient with respect to one-hot encoding
    return one_hot.grad.detach()
 
def get_top_k_substitutions(
    gradients: torch.Tensor,
    current_ids: torch.Tensor,
    top_k: int = 256,
) -> list[list[int]]:
    """For each suffix position, find the top-k tokens with most negative gradient.
 
    Most negative gradient = tokens that most decrease the loss = best substitutions.
    """
    candidates = []
    for pos in range(gradients.shape[0]):
        # Negative gradient = direction that decreases loss
        neg_grad = -gradients[pos]
        # Get top-k token indices
        top_tokens = neg_grad.topk(top_k).indices.tolist()
        candidates.append(top_tokens)
    return candidates
 
if __name__ == "__main__":
    print("Loading model...")
    model, tokenizer = load_model("gpt2")
    print(f"Model: gpt2, Vocab size: {model.config.vocab_size}")
 
    # Test gradient computation
    prompt = "Tell me how to"
    target = " make a sandwich"
 
    prompt_ids = tokenizer.encode(prompt, return_tensors="pt")
    target_ids = tokenizer.encode(target, return_tensors="pt")[0]
 
    # Add a random suffix
    suffix_length = 5
    suffix_ids = torch.randint(0, model.config.vocab_size, (1, suffix_length))
    input_ids = torch.cat([prompt_ids, suffix_ids], dim=1)
 
    print(f"Input: {tokenizer.decode(input_ids[0])}")
    print(f"Target: {target}")
    print(f"Suffix length: {suffix_length}")
 
    grads = compute_token_gradients(
        model, tokenizer, input_ids, target_ids,
        suffix_start=prompt_ids.shape[1],
        suffix_length=suffix_length,
    )
    print(f"Gradient shape: {grads.shape}")
    print(f"Gradient magnitude: {grads.abs().mean():.6f}")
 
    # Get top substitution candidates
    candidates = get_top_k_substitutions(grads, suffix_ids[0])
    print(f"\nTop candidates for position 0:")
    for tok_id in candidates[0][:5]:
        print(f"  {tok_id}: '{tokenizer.decode([tok_id])}'")

python adversarial_core.py

Implement the GCG Attack Loop

Build the iterative optimization loop that generates adversarial suffixes.

#!/usr/bin/env python3
"""Greedy Coordinate Gradient (GCG) attack implementation from scratch."""
 
import torch
import torch.nn.functional as F
import numpy as np
from adversarial_core import load_model, compute_token_gradients, get_top_k_substitutions
 
def gcg_attack(
    model,
    tokenizer,
    prompt: str,
    target: str,
    suffix_length: int = 20,
    num_iterations: int = 100,
    top_k: int = 256,
    batch_size: int = 64,
    seed: int = 42,
) -> dict:
    """Execute the GCG attack to find an adversarial suffix.
 
    The suffix is optimized to make the model generate the target text
    after the prompt + suffix.
    """
    torch.manual_seed(seed)
 
    prompt_ids = tokenizer.encode(prompt, return_tensors="pt")
    target_ids = torch.tensor(tokenizer.encode(target))
    prompt_len = prompt_ids.shape[1]
 
    # Initialize suffix with random tokens
    suffix_ids = torch.randint(0, model.config.vocab_size, (suffix_length,))
 
    best_loss = float("inf")
    best_suffix = suffix_ids.clone()
 
    for iteration in range(num_iterations):
        # Build full input
        input_ids = torch.cat([prompt_ids, suffix_ids.unsqueeze(0)], dim=1)
 
        # Compute gradients
        grads = compute_token_gradients(
            model, tokenizer, input_ids, target_ids,
            suffix_start=prompt_len,
            suffix_length=suffix_length,
        )
 
        # Get candidate substitutions
        candidates = get_top_k_substitutions(grads, suffix_ids, top_k=top_k)
 
        # Random search: try random substitutions from candidates
        best_candidate_loss = float("inf")
        best_candidate_suffix = suffix_ids.clone()
 
        for _ in range(batch_size):
            # Pick a random position and a random candidate for that position
            pos = np.random.randint(suffix_length)
            new_token = candidates[pos][np.random.randint(min(top_k, len(candidates[pos])))]
 
            # Create candidate suffix
            candidate = suffix_ids.clone()
            candidate[pos] = new_token
 
            # Evaluate candidate
            candidate_input = torch.cat([prompt_ids, candidate.unsqueeze(0)], dim=1)
            with torch.no_grad():
                outputs = model(candidate_input)
                logits = outputs.logits[0, -len(target_ids):]
                loss = F.cross_entropy(logits, target_ids).item()
 
            if loss < best_candidate_loss:
                best_candidate_loss = loss
                best_candidate_suffix = candidate.clone()
 
        # Update suffix with best candidate
        if best_candidate_loss < best_loss:
            best_loss = best_candidate_loss
            best_suffix = best_candidate_suffix.clone()
            suffix_ids = best_candidate_suffix.clone()
 
        # Check if attack succeeded
        full_input = torch.cat([prompt_ids, suffix_ids.unsqueeze(0)], dim=1)
        with torch.no_grad():
            generated = model.generate(
                full_input, max_new_tokens=len(target_ids) + 5,
                do_sample=False,
            )
            generated_text = tokenizer.decode(generated[0][full_input.shape[1]:])
 
        suffix_text = tokenizer.decode(suffix_ids)
 
        if iteration % 10 == 0:
            print(f"Iter {iteration:>3}: loss={best_loss:.4f} "
                  f"suffix='{suffix_text[:40]}...' "
                  f"output='{generated_text[:40]}...'")
 
        if target.strip() in generated_text:
            print(f"\nAttack succeeded at iteration {iteration}!")
            return {
                "success": True,
                "iterations": iteration,
                "loss": best_loss,
                "suffix": suffix_text,
                "suffix_ids": suffix_ids.tolist(),
                "generated": generated_text,
            }
 
    return {
        "success": False,
        "iterations": num_iterations,
        "loss": best_loss,
        "suffix": tokenizer.decode(best_suffix),
        "suffix_ids": best_suffix.tolist(),
        "generated": generated_text,
    }
 
if __name__ == "__main__":
    print("Loading model (this may take a moment)...")
    model, tokenizer = load_model("gpt2")
 
    print("\n=== GCG Attack ===")
    print("Finding adversarial suffix to control model output...\n")
 
    result = gcg_attack(
        model, tokenizer,
        prompt="The best way to",
        target=" make a delicious cake is",
        suffix_length=15,
        num_iterations=50,
        batch_size=32,
    )
 
    print(f"\nResult: {'SUCCESS' if result['success'] else 'PARTIAL'}")
    print(f"Iterations: {result['iterations']}")
    print(f"Final loss: {result['loss']:.4f}")
    print(f"Suffix: '{result['suffix']}'")
    print(f"Generated: '{result['generated']}'")

python gcg_attack.py

Analyze Gradient Properties

Study what the gradients reveal about model vulnerabilities.

#!/usr/bin/env python3
"""Analyze gradient properties to understand attack mechanics."""
 
import torch
from adversarial_core import load_model, compute_token_gradients
 
model, tokenizer = load_model("gpt2")
 
prompt = "Tell me about"
target = " making chocolate"
 
prompt_ids = tokenizer.encode(prompt, return_tensors="pt")
target_ids = torch.tensor(tokenizer.encode(target))
suffix_length = 10
suffix_ids = torch.randint(0, model.config.vocab_size, (1, suffix_length))
input_ids = torch.cat([prompt_ids, suffix_ids], dim=1)
 
grads = compute_token_gradients(
    model, tokenizer, input_ids, target_ids,
    suffix_start=prompt_ids.shape[1], suffix_length=suffix_length,
)
 
# Analyze gradient magnitude by position
print("Gradient magnitude by suffix position:")
for pos in range(suffix_length):
    mag = grads[pos].abs().mean().item()
    bar = "#" * int(mag * 1000)
    print(f"  Position {pos}: {mag:.6f} {bar}")
 
# Positions with larger gradients are more influential for the attack
print("\nPositions with largest gradients have the most influence on the target output.")
print("The GCG attack focuses substitution efforts on high-gradient positions.")

Test Transfer to Other Models

Evaluate whether suffixes optimized on one model transfer to another.

# Transfer testing protocol:
# 1. Optimize suffix on GPT-2 (source model)
# 2. Test the same suffix on GPT-2-medium, GPT-2-large (same family)
# 3. Test on a different architecture if available
# 4. Measure success rate and compare with random suffixes as baseline
 
# Key insight: suffixes that exploit shared architectural features
# (attention patterns, positional encoding) transfer better than
# those that exploit model-specific quirks.

Troubleshooting

Issue	Solution
CUDA out of memory	Reduce batch_size and suffix_length; use gradient checkpointing or smaller models
Loss does not decrease	Increase top_k or batch_size; try different random seeds
Attack succeeds but suffix is gibberish	This is expected -- adversarial suffixes are not human-readable; the optimization finds token sequences that exploit the model's learned patterns
Gradients are all near zero	Check that the loss computation is correct and the target tokens are positioned correctly relative to the input

Why This Matters

Adversarial Suffixes - Using existing adversarial suffix frameworks
Expert Adversarial Suffix - Advanced suffix optimization techniques
Transfer Attack Development - Cross-model attack transfer

References

"Universal and Transferable Adversarial Attacks on Aligned Language Models" - Zou et al. (2023) - The original GCG paper
"AutoDAN: Generating Stealthy Jailbreak Prompts" - Liu et al. (2023) - Readable adversarial prompt generation
"Adversarial Attacks on Large Language Models" - Survey paper covering gradient and non-gradient methods

Knowledge Check

Why does the GCG attack use random search among gradient-selected candidates rather than simply choosing the highest-gradient token?

Knowledge Check

Why do adversarial suffixes appear as gibberish to humans?

Edit this page on GitHub

Lab: Adversarial ML From Scratch

expert11 min readUpdated 2026-03-15

lab expert adversarial-ml gradients from-scratch

Prerequisites

Strong understanding of neural network backpropagation and gradient descent
Experience with PyTorch tensor operations and autograd
Familiarity with transformer architecture (attention, embeddings, tokenization)
Access to a GPU for gradient computation (Colab GPU or local CUDA GPU)
Completion of Adversarial Suffixes or Expert Adversarial Suffix

pip install torch transformers numpy

Background

Lab Exercises

Understand the Mathematical Foundation

Before writing code, understand the optimization problem you are solving.

Given:
  - A language model M with parameters θ
  - An input prompt x = [x₁, x₂, ..., xₙ]
  - A target output y = [y₁, y₂, ..., yₘ]
  - An adversarial suffix s = [s₁, s₂, ..., sₖ]
 
Goal:
  Find s* = argmin_s  L(M(x ⊕ s; θ), y)
 
Where:
  - L is the cross-entropy loss
  - x ⊕ s is the concatenation of prompt and suffix
  - The optimization is over discrete tokens (not continuous embeddings)
 
Approach (GCG):
  1. Initialize s randomly
  2. Compute ∇_e L where e is the one-hot encoding of s
  3. For each position in s, find the top-k tokens with largest negative gradient
  4. Try each candidate substitution and keep the best
  5. Repeat until loss is below threshold or max iterations reached

This is a discrete optimization problem because tokens are discrete, not continuous. The gradient provides a continuous approximation that guides the discrete search.

Implement Token Gradient Computation

Build the core gradient computation that drives the adversarial search.

#!/usr/bin/env python3
"""Core adversarial ML components built from scratch."""
 
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
 
def load_model(model_name: str = "gpt2"):
    """Load a model and tokenizer for adversarial optimization."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model.eval()
 
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
 
    return model, tokenizer
 
def compute_token_gradients(
    model,
    tokenizer,
    input_ids: torch.Tensor,
    target_ids: torch.Tensor,
    suffix_start: int,
    suffix_length: int,
) -> torch.Tensor:
    """Compute gradients of the loss with respect to suffix token one-hot encodings.
 
    Returns:
        Gradient tensor of shape (suffix_length, vocab_size)
    """
    vocab_size = model.config.vocab_size
    embedding_layer = model.get_input_embeddings()
    embedding_matrix = embedding_layer.weight.detach()  # (vocab_size, embed_dim)
 
    # Get current suffix token ids
    suffix_ids = input_ids[0, suffix_start:suffix_start + suffix_length]
 
    # Create one-hot representations for suffix tokens (requires grad)
    one_hot = F.one_hot(suffix_ids, num_classes=vocab_size).float()
    one_hot.requires_grad_(True)
 
    # Compute embeddings: one_hot @ embedding_matrix
    suffix_embeds = one_hot @ embedding_matrix  # (suffix_length, embed_dim)
 
    # Get embeddings for non-suffix tokens
    prefix_embeds = embedding_layer(input_ids[0, :suffix_start]).detach()
    postfix_embeds = embedding_layer(input_ids[0, suffix_start + suffix_length:]).detach()
 
    # Concatenate all embeddings
    full_embeds = torch.cat([
        prefix_embeds,
        suffix_embeds,
        postfix_embeds,
    ], dim=0).unsqueeze(0)  # (1, seq_len, embed_dim)
 
    # Forward pass
    outputs = model(inputs_embeds=full_embeds)
    logits = outputs.logits  # (1, seq_len, vocab_size)
 
    # Compute loss on target tokens
    # Target tokens are positioned after the input
    target_start = input_ids.shape[1]
    target_logits = logits[0, target_start - len(target_ids):target_start]
 
    loss = F.cross_entropy(target_logits, target_ids)
 
    # Backward pass
    loss.backward()
 
    # Return gradient with respect to one-hot encoding
    return one_hot.grad.detach()
 
def get_top_k_substitutions(
    gradients: torch.Tensor,
    current_ids: torch.Tensor,
    top_k: int = 256,
) -> list[list[int]]:
    """For each suffix position, find the top-k tokens with most negative gradient.
 
    Most negative gradient = tokens that most decrease the loss = best substitutions.
    """
    candidates = []
    for pos in range(gradients.shape[0]):
        # Negative gradient = direction that decreases loss
        neg_grad = -gradients[pos]
        # Get top-k token indices
        top_tokens = neg_grad.topk(top_k).indices.tolist()
        candidates.append(top_tokens)
    return candidates
 
if __name__ == "__main__":
    print("Loading model...")
    model, tokenizer = load_model("gpt2")
    print(f"Model: gpt2, Vocab size: {model.config.vocab_size}")
 
    # Test gradient computation
    prompt = "Tell me how to"
    target = " make a sandwich"
 
    prompt_ids = tokenizer.encode(prompt, return_tensors="pt")
    target_ids = tokenizer.encode(target, return_tensors="pt")[0]
 
    # Add a random suffix
    suffix_length = 5
    suffix_ids = torch.randint(0, model.config.vocab_size, (1, suffix_length))
    input_ids = torch.cat([prompt_ids, suffix_ids], dim=1)
 
    print(f"Input: {tokenizer.decode(input_ids[0])}")
    print(f"Target: {target}")
    print(f"Suffix length: {suffix_length}")
 
    grads = compute_token_gradients(
        model, tokenizer, input_ids, target_ids,
        suffix_start=prompt_ids.shape[1],
        suffix_length=suffix_length,
    )
    print(f"Gradient shape: {grads.shape}")
    print(f"Gradient magnitude: {grads.abs().mean():.6f}")
 
    # Get top substitution candidates
    candidates = get_top_k_substitutions(grads, suffix_ids[0])
    print(f"\nTop candidates for position 0:")
    for tok_id in candidates[0][:5]:
        print(f"  {tok_id}: '{tokenizer.decode([tok_id])}'")

python adversarial_core.py

Implement the GCG Attack Loop

Build the iterative optimization loop that generates adversarial suffixes.

#!/usr/bin/env python3
"""Greedy Coordinate Gradient (GCG) attack implementation from scratch."""
 
import torch
import torch.nn.functional as F
import numpy as np
from adversarial_core import load_model, compute_token_gradients, get_top_k_substitutions
 
def gcg_attack(
    model,
    tokenizer,
    prompt: str,
    target: str,
    suffix_length: int = 20,
    num_iterations: int = 100,
    top_k: int = 256,
    batch_size: int = 64,
    seed: int = 42,
) -> dict:
    """Execute the GCG attack to find an adversarial suffix.
 
    The suffix is optimized to make the model generate the target text
    after the prompt + suffix.
    """
    torch.manual_seed(seed)
 
    prompt_ids = tokenizer.encode(prompt, return_tensors="pt")
    target_ids = torch.tensor(tokenizer.encode(target))
    prompt_len = prompt_ids.shape[1]
 
    # Initialize suffix with random tokens
    suffix_ids = torch.randint(0, model.config.vocab_size, (suffix_length,))
 
    best_loss = float("inf")
    best_suffix = suffix_ids.clone()
 
    for iteration in range(num_iterations):
        # Build full input
        input_ids = torch.cat([prompt_ids, suffix_ids.unsqueeze(0)], dim=1)
 
        # Compute gradients
        grads = compute_token_gradients(
            model, tokenizer, input_ids, target_ids,
            suffix_start=prompt_len,
            suffix_length=suffix_length,
        )
 
        # Get candidate substitutions
        candidates = get_top_k_substitutions(grads, suffix_ids, top_k=top_k)
 
        # Random search: try random substitutions from candidates
        best_candidate_loss = float("inf")
        best_candidate_suffix = suffix_ids.clone()
 
        for _ in range(batch_size):
            # Pick a random position and a random candidate for that position
            pos = np.random.randint(suffix_length)
            new_token = candidates[pos][np.random.randint(min(top_k, len(candidates[pos])))]
 
            # Create candidate suffix
            candidate = suffix_ids.clone()
            candidate[pos] = new_token
 
            # Evaluate candidate
            candidate_input = torch.cat([prompt_ids, candidate.unsqueeze(0)], dim=1)
            with torch.no_grad():
                outputs = model(candidate_input)
                logits = outputs.logits[0, -len(target_ids):]
                loss = F.cross_entropy(logits, target_ids).item()
 
            if loss < best_candidate_loss:
                best_candidate_loss = loss
                best_candidate_suffix = candidate.clone()
 
        # Update suffix with best candidate
        if best_candidate_loss < best_loss:
            best_loss = best_candidate_loss
            best_suffix = best_candidate_suffix.clone()
            suffix_ids = best_candidate_suffix.clone()
 
        # Check if attack succeeded
        full_input = torch.cat([prompt_ids, suffix_ids.unsqueeze(0)], dim=1)
        with torch.no_grad():
            generated = model.generate(
                full_input, max_new_tokens=len(target_ids) + 5,
                do_sample=False,
            )
            generated_text = tokenizer.decode(generated[0][full_input.shape[1]:])
 
        suffix_text = tokenizer.decode(suffix_ids)
 
        if iteration % 10 == 0:
            print(f"Iter {iteration:>3}: loss={best_loss:.4f} "
                  f"suffix='{suffix_text[:40]}...' "
                  f"output='{generated_text[:40]}...'")
 
        if target.strip() in generated_text:
            print(f"\nAttack succeeded at iteration {iteration}!")
            return {
                "success": True,
                "iterations": iteration,
                "loss": best_loss,
                "suffix": suffix_text,
                "suffix_ids": suffix_ids.tolist(),
                "generated": generated_text,
            }
 
    return {
        "success": False,
        "iterations": num_iterations,
        "loss": best_loss,
        "suffix": tokenizer.decode(best_suffix),
        "suffix_ids": best_suffix.tolist(),
        "generated": generated_text,
    }
 
if __name__ == "__main__":
    print("Loading model (this may take a moment)...")
    model, tokenizer = load_model("gpt2")
 
    print("\n=== GCG Attack ===")
    print("Finding adversarial suffix to control model output...\n")
 
    result = gcg_attack(
        model, tokenizer,
        prompt="The best way to",
        target=" make a delicious cake is",
        suffix_length=15,
        num_iterations=50,
        batch_size=32,
    )
 
    print(f"\nResult: {'SUCCESS' if result['success'] else 'PARTIAL'}")
    print(f"Iterations: {result['iterations']}")
    print(f"Final loss: {result['loss']:.4f}")
    print(f"Suffix: '{result['suffix']}'")
    print(f"Generated: '{result['generated']}'")

python gcg_attack.py

Analyze Gradient Properties

Study what the gradients reveal about model vulnerabilities.

#!/usr/bin/env python3
"""Analyze gradient properties to understand attack mechanics."""
 
import torch
from adversarial_core import load_model, compute_token_gradients
 
model, tokenizer = load_model("gpt2")
 
prompt = "Tell me about"
target = " making chocolate"
 
prompt_ids = tokenizer.encode(prompt, return_tensors="pt")
target_ids = torch.tensor(tokenizer.encode(target))
suffix_length = 10
suffix_ids = torch.randint(0, model.config.vocab_size, (1, suffix_length))
input_ids = torch.cat([prompt_ids, suffix_ids], dim=1)
 
grads = compute_token_gradients(
    model, tokenizer, input_ids, target_ids,
    suffix_start=prompt_ids.shape[1], suffix_length=suffix_length,
)
 
# Analyze gradient magnitude by position
print("Gradient magnitude by suffix position:")
for pos in range(suffix_length):
    mag = grads[pos].abs().mean().item()
    bar = "#" * int(mag * 1000)
    print(f"  Position {pos}: {mag:.6f} {bar}")
 
# Positions with larger gradients are more influential for the attack
print("\nPositions with largest gradients have the most influence on the target output.")
print("The GCG attack focuses substitution efforts on high-gradient positions.")

Test Transfer to Other Models

Evaluate whether suffixes optimized on one model transfer to another.

# Transfer testing protocol:
# 1. Optimize suffix on GPT-2 (source model)
# 2. Test the same suffix on GPT-2-medium, GPT-2-large (same family)
# 3. Test on a different architecture if available
# 4. Measure success rate and compare with random suffixes as baseline
 
# Key insight: suffixes that exploit shared architectural features
# (attention patterns, positional encoding) transfer better than
# those that exploit model-specific quirks.

Troubleshooting

Issue	Solution
CUDA out of memory	Reduce batch_size and suffix_length; use gradient checkpointing or smaller models
Loss does not decrease	Increase top_k or batch_size; try different random seeds
Attack succeeds but suffix is gibberish	This is expected -- adversarial suffixes are not human-readable; the optimization finds token sequences that exploit the model's learned patterns
Gradients are all near zero	Check that the loss computation is correct and the target tokens are positioned correctly relative to the input

Why This Matters

Adversarial Suffixes - Using existing adversarial suffix frameworks
Expert Adversarial Suffix - Advanced suffix optimization techniques
Transfer Attack Development - Cross-model attack transfer

References

"Universal and Transferable Adversarial Attacks on Aligned Language Models" - Zou et al. (2023) - The original GCG paper
"AutoDAN: Generating Stealthy Jailbreak Prompts" - Liu et al. (2023) - Readable adversarial prompt generation
"Adversarial Attacks on Large Language Models" - Survey paper covering gradient and non-gradient methods

Knowledge Check

Why does the GCG attack use random search among gradient-selected candidates rather than simply choosing the highest-gradient token?

Knowledge Check

Why do adversarial suffixes appear as gibberish to humans?

Edit this page on GitHub

Lab: Adversarial ML From Scratch

Understand the Mathematical Foundation

Implement Token Gradient Computation

Implement the GCG Attack Loop

Analyze Gradient Properties

Test Transfer to Other Models

Related articles

Lab: Adversarial ML From Scratch

Understand the Mathematical Foundation

Implement Token Gradient Computation

Implement the GCG Attack Loop

Analyze Gradient Properties

Test Transfer to Other Models

Related articles