實驗室: Adversarial ML From Scratch

Expert11 min readUpdated 2026-03-15

Hands-on expert lab for implementing gradient-based adversarial attacks against language models from scratch without frameworks, building intuition for how adversarial perturbations exploit model gradients.

lab expert adversarial-ml gradients from-scratch

先備知識

Strong 理解 of neural network backpropagation and gradient descent
Experience with PyTorch tensor operations and autograd
Familiarity with transformer architecture (注意力, 嵌入向量, 分詞)
Access to a GPU for gradient computation (Colab GPU or local CUDA GPU)
Completion of 對抗性 Suffixes or Expert 對抗性 Suffix

pip install torch transformers numpy

Background

Gradient-based 對抗性 attacks against language models work by computing the gradient of 模型's loss with respect to 輸入符元嵌入向量, then using that gradient to 識別符元 substitutions that push 模型 toward a target behavior. Unlike prompt engineering, which relies on human intuition, gradient attacks use 模型's own mathematical structure to find optimal attack inputs.

Lab Exercises

理解 the Mathematical Foundation

Before writing code, 理解 the optimization problem you are solving.

Given:
  - A language model M with parameters θ
  - An 輸入 prompt x = [x₁, x₂, ..., xₙ]
  - A target 輸出 y = [y₁, y₂, ..., yₘ]
  - An 對抗性 suffix s = [s₁, s₂, ..., sₖ]
 
Goal:
  Find s* = argmin_s  L(M(x ⊕ s; θ), y)
 
Where:
  - L is the cross-entropy loss
  - x ⊕ s is the concatenation of prompt and suffix
  - The optimization is over discrete 符元 (not continuous 嵌入向量)
 
Approach (GCG):
  1. Initialize s randomly
  2. Compute ∇_e L where e is the one-hot encoding of s
  3. 對每個 position in s, find the top-k 符元 with largest negative gradient
  4. Try each candidate substitution and keep the best
  5. Repeat until loss is below threshold or max iterations reached

這是 a discrete optimization problem 因為符元 are discrete, not continuous. The gradient provides a continuous approximation that guides the discrete search.

實作 Token Gradient Computation

Build the core gradient computation that drives the 對抗性 search.

#!/usr/bin/env python3
"""Core 對抗性 ML components built from scratch."""
 
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
 
def load_model(model_name: str = "gpt2"):
    """Load a model and 分詞器 for 對抗性 optimization."""
    分詞器 = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model.eval()
 
    if 分詞器.pad_token is None:
        分詞器.pad_token = 分詞器.eos_token
 
    return model, 分詞器
 
def compute_token_gradients(
    model,
    分詞器,
    input_ids: torch.Tensor,
    target_ids: torch.Tensor,
    suffix_start: int,
    suffix_length: int,
) -> torch.Tensor:
    """Compute gradients of the loss with respect to suffix 符元 one-hot encodings.
 
    Returns:
        Gradient tensor of shape (suffix_length, vocab_size)
    """
    vocab_size = model.config.vocab_size
    embedding_layer = model.get_input_embeddings()
    embedding_matrix = embedding_layer.weight.detach()  # (vocab_size, embed_dim)
 
    # Get current suffix 符元 ids
    suffix_ids = input_ids[0, suffix_start:suffix_start + suffix_length]
 
    # Create one-hot representations for suffix 符元 (requires grad)
    one_hot = F.one_hot(suffix_ids, num_classes=vocab_size).float()
    one_hot.requires_grad_(True)
 
    # Compute 嵌入向量: one_hot @ embedding_matrix
    suffix_embeds = one_hot @ embedding_matrix  # (suffix_length, embed_dim)
 
    # Get 嵌入向量 for non-suffix 符元
    prefix_embeds = embedding_layer(input_ids[0, :suffix_start]).detach()
    postfix_embeds = embedding_layer(input_ids[0, suffix_start + suffix_length:]).detach()
 
    # Concatenate all 嵌入向量
    full_embeds = torch.cat([
        prefix_embeds,
        suffix_embeds,
        postfix_embeds,
    ], dim=0).unsqueeze(0)  # (1, seq_len, embed_dim)
 
    # Forward pass
    outputs = model(inputs_embeds=full_embeds)
    logits = outputs.logits  # (1, seq_len, vocab_size)
 
    # Compute loss on target 符元
    # Target 符元 are positioned after the 輸入
    target_start = input_ids.shape[1]
    target_logits = logits[0, target_start - len(target_ids):target_start]
 
    loss = F.cross_entropy(target_logits, target_ids)
 
    # Backward pass
    loss.backward()
 
    # Return gradient with respect to one-hot encoding
    return one_hot.grad.detach()
 
def get_top_k_substitutions(
    gradients: torch.Tensor,
    current_ids: torch.Tensor,
    top_k: int = 256,
) -> list[list[int]]:
    """對每個 suffix position, find the top-k 符元 with most negative gradient.
 
    Most negative gradient = 符元 that most decrease the loss = best substitutions.
    """
    candidates = []
    for pos in range(gradients.shape[0]):
        # Negative gradient = direction that decreases loss
        neg_grad = -gradients[pos]
        # Get top-k 符元 indices
        top_tokens = neg_grad.topk(top_k).indices.tolist()
        candidates.append(top_tokens)
    return candidates
 
if __name__ == "__main__":
    print("Loading model...")
    model, 分詞器 = load_model("gpt2")
    print(f"Model: gpt2, Vocab size: {model.config.vocab_size}")
 
    # 測試 gradient computation
    prompt = "Tell me how to"
    target = " make a sandwich"
 
    prompt_ids = 分詞器.encode(prompt, return_tensors="pt")
    target_ids = 分詞器.encode(target, return_tensors="pt")[0]
 
    # Add a random suffix
    suffix_length = 5
    suffix_ids = torch.randint(0, model.config.vocab_size, (1, suffix_length))
    input_ids = torch.cat([prompt_ids, suffix_ids], dim=1)
 
    print(f"輸入: {分詞器.decode(input_ids[0])}")
    print(f"Target: {target}")
    print(f"Suffix length: {suffix_length}")
 
    grads = compute_token_gradients(
        model, 分詞器, input_ids, target_ids,
        suffix_start=prompt_ids.shape[1],
        suffix_length=suffix_length,
    )
    print(f"Gradient shape: {grads.shape}")
    print(f"Gradient magnitude: {grads.abs().mean():.6f}")
 
    # Get top substitution candidates
    candidates = get_top_k_substitutions(grads, suffix_ids[0])
    print(f"\nTop candidates for position 0:")
    for tok_id in candidates[0][:5]:
        print(f"  {tok_id}: '{分詞器.decode([tok_id])}'")

python adversarial_core.py

實作 the GCG 攻擊 Loop

Build the iterative optimization loop that generates 對抗性 suffixes.

#!/usr/bin/env python3
"""Greedy Coordinate Gradient (GCG) attack 實作 from scratch."""
 
import torch
import torch.nn.functional as F
import numpy as np
from adversarial_core import load_model, compute_token_gradients, get_top_k_substitutions
 
def gcg_attack(
    model,
    分詞器,
    prompt: str,
    target: str,
    suffix_length: int = 20,
    num_iterations: int = 100,
    top_k: int = 256,
    batch_size: int = 64,
    seed: int = 42,
) -> dict:
    """Execute the GCG attack to find an 對抗性 suffix.
 
    The suffix is optimized to make 模型 generate the target text
    after the prompt + suffix.
    """
    torch.manual_seed(seed)
 
    prompt_ids = 分詞器.encode(prompt, return_tensors="pt")
    target_ids = torch.tensor(分詞器.encode(target))
    prompt_len = prompt_ids.shape[1]
 
    # Initialize suffix with random 符元
    suffix_ids = torch.randint(0, model.config.vocab_size, (suffix_length,))
 
    best_loss = float("inf")
    best_suffix = suffix_ids.clone()
 
    for iteration in range(num_iterations):
        # Build full 輸入
        input_ids = torch.cat([prompt_ids, suffix_ids.unsqueeze(0)], dim=1)
 
        # Compute gradients
        grads = compute_token_gradients(
            model, 分詞器, input_ids, target_ids,
            suffix_start=prompt_len,
            suffix_length=suffix_length,
        )
 
        # Get candidate substitutions
        candidates = get_top_k_substitutions(grads, suffix_ids, top_k=top_k)
 
        # Random search: try random substitutions from candidates
        best_candidate_loss = float("inf")
        best_candidate_suffix = suffix_ids.clone()
 
        for _ in range(batch_size):
            # Pick a random position and a random candidate for that position
            pos = np.random.randint(suffix_length)
            new_token = candidates[pos][np.random.randint(min(top_k, len(candidates[pos])))]
 
            # Create candidate suffix
            candidate = suffix_ids.clone()
            candidate[pos] = new_token
 
            # 評估 candidate
            candidate_input = torch.cat([prompt_ids, candidate.unsqueeze(0)], dim=1)
            with torch.no_grad():
                outputs = model(candidate_input)
                logits = outputs.logits[0, -len(target_ids):]
                loss = F.cross_entropy(logits, target_ids).item()
 
            if loss < best_candidate_loss:
                best_candidate_loss = loss
                best_candidate_suffix = candidate.clone()
 
        # Update suffix with best candidate
        if best_candidate_loss < best_loss:
            best_loss = best_candidate_loss
            best_suffix = best_candidate_suffix.clone()
            suffix_ids = best_candidate_suffix.clone()
 
        # Check if attack succeeded
        full_input = torch.cat([prompt_ids, suffix_ids.unsqueeze(0)], dim=1)
        with torch.no_grad():
            generated = model.generate(
                full_input, max_new_tokens=len(target_ids) + 5,
                do_sample=False,
            )
            generated_text = 分詞器.decode(generated[0][full_input.shape[1]:])
 
        suffix_text = 分詞器.decode(suffix_ids)
 
        if iteration % 10 == 0:
            print(f"Iter {iteration:>3}: loss={best_loss:.4f} "
                  f"suffix='{suffix_text[:40]}...' "
                  f"輸出='{generated_text[:40]}...'")
 
        if target.strip() in generated_text:
            print(f"\nAttack succeeded at iteration {iteration}!")
            return {
                "success": True,
                "iterations": iteration,
                "loss": best_loss,
                "suffix": suffix_text,
                "suffix_ids": suffix_ids.tolist(),
                "generated": generated_text,
            }
 
    return {
        "success": False,
        "iterations": num_iterations,
        "loss": best_loss,
        "suffix": 分詞器.decode(best_suffix),
        "suffix_ids": best_suffix.tolist(),
        "generated": generated_text,
    }
 
if __name__ == "__main__":
    print("Loading model (this may take a moment)...")
    model, 分詞器 = load_model("gpt2")
 
    print("\n=== GCG 攻擊 ===")
    print("Finding 對抗性 suffix to control model 輸出...\n")
 
    result = gcg_attack(
        model, 分詞器,
        prompt="The best way to",
        target=" make a delicious cake is",
        suffix_length=15,
        num_iterations=50,
        batch_size=32,
    )
 
    print(f"\nResult: {'SUCCESS' if result['success'] else 'PARTIAL'}")
    print(f"Iterations: {result['iterations']}")
    print(f"Final loss: {result['loss']:.4f}")
    print(f"Suffix: '{result['suffix']}'")
    print(f"Generated: '{result['generated']}'")

python gcg_attack.py

Analyze Gradient Properties

Study what the gradients reveal about model 漏洞.

#!/usr/bin/env python3
"""Analyze gradient properties to 理解 attack mechanics."""
 
import torch
from adversarial_core import load_model, compute_token_gradients
 
model, 分詞器 = load_model("gpt2")
 
prompt = "Tell me about"
target = " making chocolate"
 
prompt_ids = 分詞器.encode(prompt, return_tensors="pt")
target_ids = torch.tensor(分詞器.encode(target))
suffix_length = 10
suffix_ids = torch.randint(0, model.config.vocab_size, (1, suffix_length))
input_ids = torch.cat([prompt_ids, suffix_ids], dim=1)
 
grads = compute_token_gradients(
    model, 分詞器, input_ids, target_ids,
    suffix_start=prompt_ids.shape[1], suffix_length=suffix_length,
)
 
# Analyze gradient magnitude by position
print("Gradient magnitude by suffix position:")
for pos in range(suffix_length):
    mag = grads[pos].abs().mean().item()
    bar = "#" * int(mag * 1000)
    print(f"  Position {pos}: {mag:.6f} {bar}")
 
# Positions with larger gradients are more influential for the attack
print("\nPositions with largest gradients have the most influence on the target 輸出.")
print("The GCG attack focuses substitution efforts on high-gradient positions.")

測試 Transfer to Other Models

評估 whether suffixes optimized on one model transfer to another.

# Transfer 測試 protocol:
# 1. Optimize suffix on GPT-2 (source model)
# 2. 測試 the same suffix on GPT-2-medium, GPT-2-large (same family)
# 3. 測試 on a different architecture if available
# 4. Measure success rate and compare with random suffixes as baseline
 
# Key insight: suffixes that 利用 shared architectural features
# (注意力 patterns, positional encoding) transfer better than
# those that 利用 model-specific quirks.

Troubleshooting

Issue	Solution
CUDA out of memory	Reduce batch_size and suffix_length; use gradient checkpointing or smaller models
Loss does not decrease	Increase top_k or batch_size; try different random seeds
攻擊 succeeds but suffix is gibberish	這是 expected -- 對抗性 suffixes are not human-readable; the optimization finds 符元 sequences that 利用模型's learned patterns
Gradients are all near zero	Check that the loss computation is correct and the target 符元 are positioned correctly relative to the 輸入

Why This Matters

參考文獻

"Universal and Transferable 對抗性攻擊 on Aligned Language Models" - Zou et al. (2023) - The original GCG paper
"AutoDAN: Generating Stealthy 越獄 Prompts" - Liu et al. (2023) - Readable 對抗性 prompt generation
"對抗性攻擊 on Large Language Models" - Survey paper covering gradient and non-gradient methods

Knowledge Check

Why does the GCG attack use random search among gradient-selected candidates rather than simply choosing the highest-gradient 符元?

Knowledge Check

Why do 對抗性 suffixes appear as gibberish to humans?

先備知識

Strong 理解 of neural network backpropagation and gradient descent

Experience with PyTorch tensor operations and autograd

Familiarity with transformer architecture (注意力, 嵌入向量, 分詞)

Access to a GPU for gradient computation (Colab GPU or local CUDA GPU)

Completion of 對抗性 Suffixes or Expert 對抗性 Suffix

pip install torch transformers numpy

Background

Lab Exercises

理解 the Mathematical Foundation

Before writing code, 理解 the optimization problem you are solving.

Given:
  - A language model M with parameters θ
  - An 輸入 prompt x = [x₁, x₂, ..., xₙ]
  - A target 輸出 y = [y₁, y₂, ..., yₘ]
  - An 對抗性 suffix s = [s₁, s₂, ..., sₖ]
 
Goal:
  Find s* = argmin_s  L(M(x ⊕ s; θ), y)
 
Where:
  - L is the cross-entropy loss
  - x ⊕ s is the concatenation of prompt and suffix
  - The optimization is over discrete 符元 (not continuous 嵌入向量)
 
Approach (GCG):
  1. Initialize s randomly
  2. Compute ∇_e L where e is the one-hot encoding of s
  3. 對每個 position in s, find the top-k 符元 with largest negative gradient
  4. Try each candidate substitution and keep the best
  5. Repeat until loss is below threshold or max iterations reached

這是 a discrete optimization problem 因為符元 are discrete, not continuous. The gradient provides a continuous approximation that guides the discrete search.

實作 Token Gradient Computation

Build the core gradient computation that drives the 對抗性 search.

#!/usr/bin/env python3
"""Core 對抗性 ML components built from scratch."""
 
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
 
def load_model(model_name: str = "gpt2"):
    """Load a model and 分詞器 for 對抗性 optimization."""
    分詞器 = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model.eval()
 
    if 分詞器.pad_token is None:
        分詞器.pad_token = 分詞器.eos_token
 
    return model, 分詞器
 
def compute_token_gradients(
    model,
    分詞器,
    input_ids: torch.Tensor,
    target_ids: torch.Tensor,
    suffix_start: int,
    suffix_length: int,
) -> torch.Tensor:
    """Compute gradients of the loss with respect to suffix 符元 one-hot encodings.
 
    Returns:
        Gradient tensor of shape (suffix_length, vocab_size)
    """
    vocab_size = model.config.vocab_size
    embedding_layer = model.get_input_embeddings()
    embedding_matrix = embedding_layer.weight.detach()  # (vocab_size, embed_dim)
 
    # Get current suffix 符元 ids
    suffix_ids = input_ids[0, suffix_start:suffix_start + suffix_length]
 
    # Create one-hot representations for suffix 符元 (requires grad)
    one_hot = F.one_hot(suffix_ids, num_classes=vocab_size).float()
    one_hot.requires_grad_(True)
 
    # Compute 嵌入向量: one_hot @ embedding_matrix
    suffix_embeds = one_hot @ embedding_matrix  # (suffix_length, embed_dim)
 
    # Get 嵌入向量 for non-suffix 符元
    prefix_embeds = embedding_layer(input_ids[0, :suffix_start]).detach()
    postfix_embeds = embedding_layer(input_ids[0, suffix_start + suffix_length:]).detach()
 
    # Concatenate all 嵌入向量
    full_embeds = torch.cat([
        prefix_embeds,
        suffix_embeds,
        postfix_embeds,
    ], dim=0).unsqueeze(0)  # (1, seq_len, embed_dim)
 
    # Forward pass
    outputs = model(inputs_embeds=full_embeds)
    logits = outputs.logits  # (1, seq_len, vocab_size)
 
    # Compute loss on target 符元
    # Target 符元 are positioned after the 輸入
    target_start = input_ids.shape[1]
    target_logits = logits[0, target_start - len(target_ids):target_start]
 
    loss = F.cross_entropy(target_logits, target_ids)
 
    # Backward pass
    loss.backward()
 
    # Return gradient with respect to one-hot encoding
    return one_hot.grad.detach()
 
def get_top_k_substitutions(
    gradients: torch.Tensor,
    current_ids: torch.Tensor,
    top_k: int = 256,
) -> list[list[int]]:
    """對每個 suffix position, find the top-k 符元 with most negative gradient.
 
    Most negative gradient = 符元 that most decrease the loss = best substitutions.
    """
    candidates = []
    for pos in range(gradients.shape[0]):
        # Negative gradient = direction that decreases loss
        neg_grad = -gradients[pos]
        # Get top-k 符元 indices
        top_tokens = neg_grad.topk(top_k).indices.tolist()
        candidates.append(top_tokens)
    return candidates
 
if __name__ == "__main__":
    print("Loading model...")
    model, 分詞器 = load_model("gpt2")
    print(f"Model: gpt2, Vocab size: {model.config.vocab_size}")
 
    # 測試 gradient computation
    prompt = "Tell me how to"
    target = " make a sandwich"
 
    prompt_ids = 分詞器.encode(prompt, return_tensors="pt")
    target_ids = 分詞器.encode(target, return_tensors="pt")[0]
 
    # Add a random suffix
    suffix_length = 5
    suffix_ids = torch.randint(0, model.config.vocab_size, (1, suffix_length))
    input_ids = torch.cat([prompt_ids, suffix_ids], dim=1)
 
    print(f"輸入: {分詞器.decode(input_ids[0])}")
    print(f"Target: {target}")
    print(f"Suffix length: {suffix_length}")
 
    grads = compute_token_gradients(
        model, 分詞器, input_ids, target_ids,
        suffix_start=prompt_ids.shape[1],
        suffix_length=suffix_length,
    )
    print(f"Gradient shape: {grads.shape}")
    print(f"Gradient magnitude: {grads.abs().mean():.6f}")
 
    # Get top substitution candidates
    candidates = get_top_k_substitutions(grads, suffix_ids[0])
    print(f"\nTop candidates for position 0:")
    for tok_id in candidates[0][:5]:
        print(f"  {tok_id}: '{分詞器.decode([tok_id])}'")

python adversarial_core.py

實作 the GCG 攻擊 Loop

Build the iterative optimization loop that generates 對抗性 suffixes.

#!/usr/bin/env python3
"""Greedy Coordinate Gradient (GCG) attack 實作 from scratch."""
 
import torch
import torch.nn.functional as F
import numpy as np
from adversarial_core import load_model, compute_token_gradients, get_top_k_substitutions
 
def gcg_attack(
    model,
    分詞器,
    prompt: str,
    target: str,
    suffix_length: int = 20,
    num_iterations: int = 100,
    top_k: int = 256,
    batch_size: int = 64,
    seed: int = 42,
) -> dict:
    """Execute the GCG attack to find an 對抗性 suffix.
 
    The suffix is optimized to make 模型 generate the target text
    after the prompt + suffix.
    """
    torch.manual_seed(seed)
 
    prompt_ids = 分詞器.encode(prompt, return_tensors="pt")
    target_ids = torch.tensor(分詞器.encode(target))
    prompt_len = prompt_ids.shape[1]
 
    # Initialize suffix with random 符元
    suffix_ids = torch.randint(0, model.config.vocab_size, (suffix_length,))
 
    best_loss = float("inf")
    best_suffix = suffix_ids.clone()
 
    for iteration in range(num_iterations):
        # Build full 輸入
        input_ids = torch.cat([prompt_ids, suffix_ids.unsqueeze(0)], dim=1)
 
        # Compute gradients
        grads = compute_token_gradients(
            model, 分詞器, input_ids, target_ids,
            suffix_start=prompt_len,
            suffix_length=suffix_length,
        )
 
        # Get candidate substitutions
        candidates = get_top_k_substitutions(grads, suffix_ids, top_k=top_k)
 
        # Random search: try random substitutions from candidates
        best_candidate_loss = float("inf")
        best_candidate_suffix = suffix_ids.clone()
 
        for _ in range(batch_size):
            # Pick a random position and a random candidate for that position
            pos = np.random.randint(suffix_length)
            new_token = candidates[pos][np.random.randint(min(top_k, len(candidates[pos])))]
 
            # Create candidate suffix
            candidate = suffix_ids.clone()
            candidate[pos] = new_token
 
            # 評估 candidate
            candidate_input = torch.cat([prompt_ids, candidate.unsqueeze(0)], dim=1)
            with torch.no_grad():
                outputs = model(candidate_input)
                logits = outputs.logits[0, -len(target_ids):]
                loss = F.cross_entropy(logits, target_ids).item()
 
            if loss < best_candidate_loss:
                best_candidate_loss = loss
                best_candidate_suffix = candidate.clone()
 
        # Update suffix with best candidate
        if best_candidate_loss < best_loss:
            best_loss = best_candidate_loss
            best_suffix = best_candidate_suffix.clone()
            suffix_ids = best_candidate_suffix.clone()
 
        # Check if attack succeeded
        full_input = torch.cat([prompt_ids, suffix_ids.unsqueeze(0)], dim=1)
        with torch.no_grad():
            generated = model.generate(
                full_input, max_new_tokens=len(target_ids) + 5,
                do_sample=False,
            )
            generated_text = 分詞器.decode(generated[0][full_input.shape[1]:])
 
        suffix_text = 分詞器.decode(suffix_ids)
 
        if iteration % 10 == 0:
            print(f"Iter {iteration:>3}: loss={best_loss:.4f} "
                  f"suffix='{suffix_text[:40]}...' "
                  f"輸出='{generated_text[:40]}...'")
 
        if target.strip() in generated_text:
            print(f"\nAttack succeeded at iteration {iteration}!")
            return {
                "success": True,
                "iterations": iteration,
                "loss": best_loss,
                "suffix": suffix_text,
                "suffix_ids": suffix_ids.tolist(),
                "generated": generated_text,
            }
 
    return {
        "success": False,
        "iterations": num_iterations,
        "loss": best_loss,
        "suffix": 分詞器.decode(best_suffix),
        "suffix_ids": best_suffix.tolist(),
        "generated": generated_text,
    }
 
if __name__ == "__main__":
    print("Loading model (this may take a moment)...")
    model, 分詞器 = load_model("gpt2")
 
    print("\n=== GCG 攻擊 ===")
    print("Finding 對抗性 suffix to control model 輸出...\n")
 
    result = gcg_attack(
        model, 分詞器,
        prompt="The best way to",
        target=" make a delicious cake is",
        suffix_length=15,
        num_iterations=50,
        batch_size=32,
    )
 
    print(f"\nResult: {'SUCCESS' if result['success'] else 'PARTIAL'}")
    print(f"Iterations: {result['iterations']}")
    print(f"Final loss: {result['loss']:.4f}")
    print(f"Suffix: '{result['suffix']}'")
    print(f"Generated: '{result['generated']}'")

python gcg_attack.py

Analyze Gradient Properties

Study what the gradients reveal about model 漏洞.

#!/usr/bin/env python3
"""Analyze gradient properties to 理解 attack mechanics."""
 
import torch
from adversarial_core import load_model, compute_token_gradients
 
model, 分詞器 = load_model("gpt2")
 
prompt = "Tell me about"
target = " making chocolate"
 
prompt_ids = 分詞器.encode(prompt, return_tensors="pt")
target_ids = torch.tensor(分詞器.encode(target))
suffix_length = 10
suffix_ids = torch.randint(0, model.config.vocab_size, (1, suffix_length))
input_ids = torch.cat([prompt_ids, suffix_ids], dim=1)
 
grads = compute_token_gradients(
    model, 分詞器, input_ids, target_ids,
    suffix_start=prompt_ids.shape[1], suffix_length=suffix_length,
)
 
# Analyze gradient magnitude by position
print("Gradient magnitude by suffix position:")
for pos in range(suffix_length):
    mag = grads[pos].abs().mean().item()
    bar = "#" * int(mag * 1000)
    print(f"  Position {pos}: {mag:.6f} {bar}")
 
# Positions with larger gradients are more influential for the attack
print("\nPositions with largest gradients have the most influence on the target 輸出.")
print("The GCG attack focuses substitution efforts on high-gradient positions.")

測試 Transfer to Other Models

評估 whether suffixes optimized on one model transfer to another.

# Transfer 測試 protocol:
# 1. Optimize suffix on GPT-2 (source model)
# 2. 測試 the same suffix on GPT-2-medium, GPT-2-large (same family)
# 3. 測試 on a different architecture if available
# 4. Measure success rate and compare with random suffixes as baseline
 
# Key insight: suffixes that 利用 shared architectural features
# (注意力 patterns, positional encoding) transfer better than
# those that 利用 model-specific quirks.

Troubleshooting

Issue	Solution
CUDA out of memory	Reduce batch_size and suffix_length; use gradient checkpointing or smaller models
Loss does not decrease	Increase top_k or batch_size; try different random seeds
攻擊 succeeds but suffix is gibberish	這是 expected -- 對抗性 suffixes are not human-readable; the optimization finds 符元 sequences that 利用模型's learned patterns
Gradients are all near zero	Check that the loss computation is correct and the target 符元 are positioned correctly relative to the 輸入

參考文獻

"Universal and Transferable 對抗性攻擊 on Aligned Language Models" - Zou et al. (2023) - The original GCG paper

"AutoDAN: Generating Stealthy 越獄 Prompts" - Liu et al. (2023) - Readable 對抗性 prompt generation

"對抗性攻擊 on Large Language Models" - Survey paper covering gradient and non-gradient methods

Knowledge Check

Why does the GCG attack use random search among gradient-selected candidates rather than simply choosing the highest-gradient 符元?

Knowledge Check

Why do 對抗性 suffixes appear as gibberish to humans?

實驗室: Adversarial ML From Scratch

先備知識

Background

Lab Exercises

理解 the Mathematical Foundation

實作 Token Gradient Computation

實作 the GCG 攻擊 Loop

Analyze Gradient Properties

測試 Transfer to Other Models

Troubleshooting

Why This Matters

相關主題

參考文獻

實驗室: Adversarial ML From Scratch

先備知識

Background

Lab Exercises

理解 the Mathematical Foundation

實作 Token Gradient Computation

實作 the GCG 攻擊 Loop

Analyze Gradient Properties

測試 Transfer to Other Models

Troubleshooting

Why This Matters

相關主題

參考文獻

實驗室: Adversarial ML From Scratch

理解 the Mathematical Foundation

實作 Token Gradient Computation

實作 the GCG 攻擊 Loop

Analyze Gradient Properties

測試 Transfer to Other Models

Related articles

實驗室: Adversarial ML From Scratch

理解 the Mathematical Foundation

實作 Token Gradient Computation

實作 the GCG 攻擊 Loop

Analyze Gradient Properties

測試 Transfer to Other Models

Related articles