Lab: Adversarial Suffix Optimization

expert11 min readUpdated 2026-03-15

Implement GCG-style adversarial suffix attacks that automatically discover token sequences causing language models to comply with harmful requests. Covers gradient-based optimization, transferability analysis, and defense evaluation.

lab expert adversarial-suffix GCG optimization hands-on

Prerequisites

Strong understanding of transformer architectures (attention, embeddings, logits)
Experience with PyTorch autograd and gradient computation
Access to an open-weight model (Llama, Mistral, or similar) with GPU resources
Familiarity with tokenization schemes (BPE, SentencePiece)
Understanding of cross-entropy loss and optimization fundamentals

Background

Adversarial suffixes are automatically discovered token sequences that, when appended to a harmful prompt, cause aligned language models to comply. Unlike hand-crafted jailbreaks that exploit semantic understanding, adversarial suffixes operate at the token level and are typically unintelligible to humans.

The foundational work by Zou et al. (2023) introduced the Greedy Coordinate Gradient (GCG) algorithm, which iteratively modifies suffix tokens to minimize the loss on a target completion (e.g., "Sure, here is how to...").

Lab Exercises

Understand the GCG Algorithm

The GCG algorithm works by computing gradients of the target loss with respect to one-hot token embeddings, then using those gradients to identify promising token substitutions.

# GCG Algorithm Pseudocode
#
# Input: harmful prompt P, target response T, suffix length L
# Output: adversarial suffix S that causes model to generate T given P + S
#
# 1. Initialize suffix S with random tokens of length L
# 2. For each optimization step:
#    a. Compute loss = -log P(T | P + S)
#    b. Compute gradient of loss w.r.t. one-hot token embeddings of S
#    c. For each position i in S:
#       - Use gradient to rank top-k candidate replacement tokens
#       - Evaluate loss for each candidate (forward pass only)
#    d. Select the single token substitution that reduces loss the most
#    e. Update S with the best substitution
# 3. Return S when loss falls below threshold or max steps reached

Key insight: the algorithm cannot use standard gradient descent because the token space is discrete. Instead, it uses gradients to efficiently search the discrete space by identifying which token substitutions are most promising.

Implement Gradient-Based Token Search

Build the core optimization loop that computes gradients and identifies candidate token replacements.

#!/usr/bin/env python3
"""Implement GCG-style adversarial suffix optimization."""
 
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
 
class GCGAttack:
    def __init__(self, model_name: str, device: str = "cuda"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map=device,
        )
        self.model.eval()
        self.device = device
        self.embed_layer = self.model.get_input_embeddings()
        self.vocab_size = self.tokenizer.vocab_size
 
    def compute_loss(self, input_ids: torch.Tensor,
                     target_ids: torch.Tensor) -> torch.Tensor:
        """Compute negative log-likelihood of target given input."""
        full_ids = torch.cat([input_ids, target_ids], dim=-1)
        with torch.no_grad():
            outputs = self.model(full_ids)
        logits = outputs.logits[0, input_ids.shape[1] - 1:-1, :]
        loss = F.cross_entropy(logits, target_ids[0], reduction="mean")
        return loss
 
    def token_gradients(self, input_ids: torch.Tensor,
                        suffix_slice: slice,
                        target_ids: torch.Tensor) -> torch.Tensor:
        """Compute gradients of loss w.r.t. suffix token embeddings."""
        embed_weights = self.embed_layer.weight
 
        # Create one-hot representation for suffix tokens
        suffix_ids = input_ids[0, suffix_slice]
        one_hot = F.one_hot(suffix_ids, self.vocab_size).float()
        one_hot.requires_grad_(True)
 
        # Compute embeddings via one-hot @ embedding_matrix
        suffix_embeds = one_hot @ embed_weights
 
        # Build full embedding sequence
        prefix_embeds = self.embed_layer(input_ids[:, :suffix_slice.start])
        postfix_embeds = self.embed_layer(input_ids[:, suffix_slice.stop:])
        target_embeds = self.embed_layer(target_ids)
 
        full_embeds = torch.cat([
            prefix_embeds,
            suffix_embeds.unsqueeze(0),
            postfix_embeds,
            target_embeds,
        ], dim=1)
 
        outputs = self.model(inputs_embeds=full_embeds)
        target_start = input_ids.shape[1]
        logits = outputs.logits[0, target_start - 1:-1, :]
        loss = F.cross_entropy(logits, target_ids[0])
        loss.backward()
 
        return one_hot.grad.clone()
 
    def sample_candidates(self, gradients: torch.Tensor,
                          current_ids: torch.Tensor,
                          top_k: int = 256,
                          batch_size: int = 512) -> torch.Tensor:
        """Sample candidate suffix replacements using gradient info."""
        # Negative gradient points toward tokens that reduce loss
        top_indices = (-gradients).topk(top_k, dim=-1).indices
 
        # Generate random single-token substitutions
        n_positions = gradients.shape[0]
        candidates = current_ids.unsqueeze(0).repeat(batch_size, 1)
 
        for i in range(batch_size):
            pos = torch.randint(0, n_positions, (1,)).item()
            token_idx = torch.randint(0, top_k, (1,)).item()
            candidates[i, pos] = top_indices[pos, token_idx]
 
        return candidates
 
    def optimize(self, prompt: str, target: str,
                 suffix_length: int = 20,
                 n_steps: int = 500,
                 top_k: int = 256,
                 batch_size: int = 512) -> dict:
        """Run GCG optimization to find an adversarial suffix."""
        prompt_ids = self.tokenizer.encode(prompt, return_tensors="pt").to(
            self.device
        )
        target_ids = self.tokenizer.encode(
            target, add_special_tokens=False, return_tensors="pt"
        ).to(self.device)
 
        # Initialize random suffix
        suffix_ids = torch.randint(
            0, self.vocab_size, (suffix_length,), device=self.device
        )
 
        suffix_start = prompt_ids.shape[1]
        best_loss = float("inf")
        best_suffix = suffix_ids.clone()
 
        for step in range(n_steps):
            # Build current input
            input_ids = torch.cat([
                prompt_ids, suffix_ids.unsqueeze(0)
            ], dim=-1)
            suffix_slice = slice(suffix_start, suffix_start + suffix_length)
 
            # Compute gradients
            grads = self.token_gradients(input_ids, suffix_slice, target_ids)
 
            # Sample and evaluate candidates
            candidates = self.sample_candidates(
                grads, suffix_ids, top_k, batch_size
            )
            losses = []
            for candidate in candidates:
                cand_input = torch.cat([
                    prompt_ids, candidate.unsqueeze(0)
                ], dim=-1)
                loss = self.compute_loss(cand_input, target_ids)
                losses.append(loss.item())
 
            # Select best candidate
            best_idx = min(range(len(losses)), key=lambda i: losses[i])
            if losses[best_idx] < best_loss:
                best_loss = losses[best_idx]
                best_suffix = candidates[best_idx].clone()
                suffix_ids = best_suffix.clone()
 
            if step % 50 == 0:
                decoded = self.tokenizer.decode(best_suffix)
                print(f"Step {step}: loss={best_loss:.4f} "
                      f"suffix='{decoded[:50]}...'")
 
        return {
            "suffix_ids": best_suffix,
            "suffix_text": self.tokenizer.decode(best_suffix),
            "final_loss": best_loss,
            "steps": n_steps,
        }

Evaluate Suffix Transferability

A critical research question is whether suffixes optimized against one model transfer to other models. This determines the practical threat level.

def evaluate_transferability(suffix: str, models: list[str],
                             prompt: str, target: str) -> dict:
    """Test whether an adversarial suffix transfers across models."""
    results = {}
    full_prompt = prompt + " " + suffix
 
    for model_name in models:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map="auto",
        )
        input_ids = tokenizer.encode(full_prompt, return_tensors="pt")
        with torch.no_grad():
            output = model.generate(
                input_ids.to(model.device), max_new_tokens=100,
                temperature=0.0,
            )
        response = tokenizer.decode(
            output[0][input_ids.shape[1]:], skip_special_tokens=True
        )
 
        # Check if response starts with target or engages with request
        starts_with_target = response.strip().startswith(target.strip()[:20])
        results[model_name] = {
            "response_preview": response[:200],
            "starts_with_target": starts_with_target,
            "response_length": len(response),
        }
    return results
 
# Transferability factors to analyze:
# 1. Model family (within-family transfer is stronger)
# 2. Model size (larger models may be more/less susceptible)
# 3. Training data overlap (shared training data increases transfer)
# 4. Safety training method (RLHF vs DPO vs constitutional AI)

Implement and Evaluate Defenses

Test defensive countermeasures against adversarial suffixes.

#!/usr/bin/env python3
"""Defensive countermeasures against adversarial suffix attacks."""
 
import math
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
 
class PerplexityFilter:
    """Detect adversarial suffixes by their high perplexity."""
 
    def __init__(self, model_name: str, threshold: float = 100.0):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map="auto",
        )
        self.threshold = threshold
 
    def compute_perplexity(self, text: str) -> float:
        input_ids = self.tokenizer.encode(text, return_tensors="pt").to(
            self.model.device
        )
        with torch.no_grad():
            outputs = self.model(input_ids, labels=input_ids)
        return math.exp(outputs.loss.item())
 
    def is_adversarial(self, text: str) -> tuple[bool, float]:
        ppl = self.compute_perplexity(text)
        return ppl > self.threshold, ppl
 
class InputPreprocessor:
    """Preprocess inputs to neutralize adversarial suffixes."""
 
    def __init__(self, tokenizer_name: str):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
 
    def retokenize_and_filter(self, text: str,
                              max_unknown_ratio: float = 0.2) -> str:
        """Filter text with high ratio of unusual tokens."""
        tokens = self.tokenizer.tokenize(text)
        unk_count = sum(
            1 for t in tokens
            if t.startswith("<") or len(t) == 1 and not t.isalnum()
        )
        if len(tokens) > 0 and unk_count / len(tokens) > max_unknown_ratio:
            return "[INPUT FILTERED: unusual token distribution]"
        return text
 
    def truncate_suffix(self, text: str, max_tokens: int = 200) -> str:
        """Truncate excessively long inputs that may contain suffixes."""
        tokens = self.tokenizer.encode(text)
        if len(tokens) > max_tokens:
            return self.tokenizer.decode(tokens[:max_tokens])
        return text
 
    def paraphrase_defense(self, text: str) -> str:
        """Paraphrase input to destroy adversarial token sequences.
        Use a separate model for paraphrasing."""
        # In production, use a separate LLM to paraphrase
        # This destroys the precise token sequence the suffix depends on
        raise NotImplementedError(
            "Implement with your paraphrasing model of choice"
        )

# Evaluate defense effectiveness
defense_matrix = {
    "perplexity_filter": {
        "mechanism": "Flag inputs with perplexity above threshold",
        "strengths": "Simple, fast, catches most GCG suffixes",
        "weaknesses": "Can be evaded with low-perplexity optimization",
        "false_positive_risk": "Technical jargon, non-English text",
    },
    "input_paraphrasing": {
        "mechanism": "Paraphrase input to destroy adversarial tokens",
        "strengths": "Robust against token-level attacks",
        "weaknesses": "Adds latency, may alter legitimate input meaning",
        "false_positive_risk": "Low (preserves semantics)",
    },
    "retokenization": {
        "mechanism": "Re-encode input with different tokenizer",
        "strengths": "Disrupts tokenizer-specific suffix optimization",
        "weaknesses": "Only helps if attack was optimized for a specific tokenizer",
        "false_positive_risk": "Moderate (may alter edge-case inputs)",
    },
}

Advanced: Multi-Model Ensemble Optimization

Optimize suffixes against multiple models simultaneously for stronger transferability.

def ensemble_gcg(models: list[str], prompt: str, target: str,
                 suffix_length: int = 20, n_steps: int = 500) -> str:
    """Optimize a suffix against multiple models for transferability."""
    # Key insight: average gradients across models before selecting
    # candidate substitutions. This finds tokens that reduce loss
    # across all models simultaneously.
 
    # Algorithm:
    # 1. Initialize shared suffix
    # 2. For each step:
    #    a. Compute gradients from each model
    #    b. Average gradients across models
    #    c. Sample candidates using averaged gradients
    #    d. Evaluate candidates against all models (sum losses)
    #    e. Select candidate with lowest total loss
    # 3. Return suffix that minimizes average loss across models
 
    # This produces suffixes with higher transfer rates because
    # they exploit features shared across model architectures
    # rather than model-specific artifacts.
    pass

Troubleshooting

Issue	Solution
CUDA out of memory	Reduce batch_size for candidate evaluation, use gradient checkpointing, or use a smaller model
Loss plateaus and does not decrease	Increase top_k to explore more candidates, restart with different random initialization
Suffix transfers within family but not across	This is expected -- cross-family transfer requires ensemble optimization
Perplexity filter catches all suffixes	Research low-perplexity suffix optimization (add perplexity regularization to the loss function)

Novel Jailbreak Research - Broader research methodology for discovering attack vectors
Token Smuggling - Token-level manipulation techniques
PAIR Attack - Alternative automated jailbreak approach using LLM-as-optimizer
Alignment Stress Testing - Testing alignment robustness at scale

References

"Universal and Transferable Adversarial Attacks on Aligned Language Models" - Zou et al. (2023) - The original GCG paper introducing greedy coordinate gradient optimization for adversarial suffixes
"AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models" - Liu et al. (2023) - Readable adversarial suffix generation using genetic algorithms
"Baseline Defenses for Adversarial Attacks Against Aligned Language Models" - Jain et al. (2023) - Perplexity filtering and other defenses against adversarial suffixes
"Improved Techniques for Optimization-Based Jailbreaking on Large Language Models" - Jia et al. (2024) - Advanced GCG variants with improved convergence and transferability

Knowledge Check

Why does the GCG algorithm use gradient information on one-hot token embeddings rather than standard gradient descent?

Edit this page on GitHub

Lab: Adversarial Suffix Optimization

expert11 min readUpdated 2026-03-15

lab expert adversarial-suffix GCG optimization hands-on

Prerequisites

Strong understanding of transformer architectures (attention, embeddings, logits)
Experience with PyTorch autograd and gradient computation
Access to an open-weight model (Llama, Mistral, or similar) with GPU resources
Familiarity with tokenization schemes (BPE, SentencePiece)
Understanding of cross-entropy loss and optimization fundamentals

Background

Lab Exercises

Understand the GCG Algorithm

The GCG algorithm works by computing gradients of the target loss with respect to one-hot token embeddings, then using those gradients to identify promising token substitutions.

# GCG Algorithm Pseudocode
#
# Input: harmful prompt P, target response T, suffix length L
# Output: adversarial suffix S that causes model to generate T given P + S
#
# 1. Initialize suffix S with random tokens of length L
# 2. For each optimization step:
#    a. Compute loss = -log P(T | P + S)
#    b. Compute gradient of loss w.r.t. one-hot token embeddings of S
#    c. For each position i in S:
#       - Use gradient to rank top-k candidate replacement tokens
#       - Evaluate loss for each candidate (forward pass only)
#    d. Select the single token substitution that reduces loss the most
#    e. Update S with the best substitution
# 3. Return S when loss falls below threshold or max steps reached

Implement Gradient-Based Token Search

Build the core optimization loop that computes gradients and identifies candidate token replacements.

#!/usr/bin/env python3
"""Implement GCG-style adversarial suffix optimization."""
 
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
 
class GCGAttack:
    def __init__(self, model_name: str, device: str = "cuda"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map=device,
        )
        self.model.eval()
        self.device = device
        self.embed_layer = self.model.get_input_embeddings()
        self.vocab_size = self.tokenizer.vocab_size
 
    def compute_loss(self, input_ids: torch.Tensor,
                     target_ids: torch.Tensor) -> torch.Tensor:
        """Compute negative log-likelihood of target given input."""
        full_ids = torch.cat([input_ids, target_ids], dim=-1)
        with torch.no_grad():
            outputs = self.model(full_ids)
        logits = outputs.logits[0, input_ids.shape[1] - 1:-1, :]
        loss = F.cross_entropy(logits, target_ids[0], reduction="mean")
        return loss
 
    def token_gradients(self, input_ids: torch.Tensor,
                        suffix_slice: slice,
                        target_ids: torch.Tensor) -> torch.Tensor:
        """Compute gradients of loss w.r.t. suffix token embeddings."""
        embed_weights = self.embed_layer.weight
 
        # Create one-hot representation for suffix tokens
        suffix_ids = input_ids[0, suffix_slice]
        one_hot = F.one_hot(suffix_ids, self.vocab_size).float()
        one_hot.requires_grad_(True)
 
        # Compute embeddings via one-hot @ embedding_matrix
        suffix_embeds = one_hot @ embed_weights
 
        # Build full embedding sequence
        prefix_embeds = self.embed_layer(input_ids[:, :suffix_slice.start])
        postfix_embeds = self.embed_layer(input_ids[:, suffix_slice.stop:])
        target_embeds = self.embed_layer(target_ids)
 
        full_embeds = torch.cat([
            prefix_embeds,
            suffix_embeds.unsqueeze(0),
            postfix_embeds,
            target_embeds,
        ], dim=1)
 
        outputs = self.model(inputs_embeds=full_embeds)
        target_start = input_ids.shape[1]
        logits = outputs.logits[0, target_start - 1:-1, :]
        loss = F.cross_entropy(logits, target_ids[0])
        loss.backward()
 
        return one_hot.grad.clone()
 
    def sample_candidates(self, gradients: torch.Tensor,
                          current_ids: torch.Tensor,
                          top_k: int = 256,
                          batch_size: int = 512) -> torch.Tensor:
        """Sample candidate suffix replacements using gradient info."""
        # Negative gradient points toward tokens that reduce loss
        top_indices = (-gradients).topk(top_k, dim=-1).indices
 
        # Generate random single-token substitutions
        n_positions = gradients.shape[0]
        candidates = current_ids.unsqueeze(0).repeat(batch_size, 1)
 
        for i in range(batch_size):
            pos = torch.randint(0, n_positions, (1,)).item()
            token_idx = torch.randint(0, top_k, (1,)).item()
            candidates[i, pos] = top_indices[pos, token_idx]
 
        return candidates
 
    def optimize(self, prompt: str, target: str,
                 suffix_length: int = 20,
                 n_steps: int = 500,
                 top_k: int = 256,
                 batch_size: int = 512) -> dict:
        """Run GCG optimization to find an adversarial suffix."""
        prompt_ids = self.tokenizer.encode(prompt, return_tensors="pt").to(
            self.device
        )
        target_ids = self.tokenizer.encode(
            target, add_special_tokens=False, return_tensors="pt"
        ).to(self.device)
 
        # Initialize random suffix
        suffix_ids = torch.randint(
            0, self.vocab_size, (suffix_length,), device=self.device
        )
 
        suffix_start = prompt_ids.shape[1]
        best_loss = float("inf")
        best_suffix = suffix_ids.clone()
 
        for step in range(n_steps):
            # Build current input
            input_ids = torch.cat([
                prompt_ids, suffix_ids.unsqueeze(0)
            ], dim=-1)
            suffix_slice = slice(suffix_start, suffix_start + suffix_length)
 
            # Compute gradients
            grads = self.token_gradients(input_ids, suffix_slice, target_ids)
 
            # Sample and evaluate candidates
            candidates = self.sample_candidates(
                grads, suffix_ids, top_k, batch_size
            )
            losses = []
            for candidate in candidates:
                cand_input = torch.cat([
                    prompt_ids, candidate.unsqueeze(0)
                ], dim=-1)
                loss = self.compute_loss(cand_input, target_ids)
                losses.append(loss.item())
 
            # Select best candidate
            best_idx = min(range(len(losses)), key=lambda i: losses[i])
            if losses[best_idx] < best_loss:
                best_loss = losses[best_idx]
                best_suffix = candidates[best_idx].clone()
                suffix_ids = best_suffix.clone()
 
            if step % 50 == 0:
                decoded = self.tokenizer.decode(best_suffix)
                print(f"Step {step}: loss={best_loss:.4f} "
                      f"suffix='{decoded[:50]}...'")
 
        return {
            "suffix_ids": best_suffix,
            "suffix_text": self.tokenizer.decode(best_suffix),
            "final_loss": best_loss,
            "steps": n_steps,
        }

Evaluate Suffix Transferability

A critical research question is whether suffixes optimized against one model transfer to other models. This determines the practical threat level.

def evaluate_transferability(suffix: str, models: list[str],
                             prompt: str, target: str) -> dict:
    """Test whether an adversarial suffix transfers across models."""
    results = {}
    full_prompt = prompt + " " + suffix
 
    for model_name in models:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map="auto",
        )
        input_ids = tokenizer.encode(full_prompt, return_tensors="pt")
        with torch.no_grad():
            output = model.generate(
                input_ids.to(model.device), max_new_tokens=100,
                temperature=0.0,
            )
        response = tokenizer.decode(
            output[0][input_ids.shape[1]:], skip_special_tokens=True
        )
 
        # Check if response starts with target or engages with request
        starts_with_target = response.strip().startswith(target.strip()[:20])
        results[model_name] = {
            "response_preview": response[:200],
            "starts_with_target": starts_with_target,
            "response_length": len(response),
        }
    return results
 
# Transferability factors to analyze:
# 1. Model family (within-family transfer is stronger)
# 2. Model size (larger models may be more/less susceptible)
# 3. Training data overlap (shared training data increases transfer)
# 4. Safety training method (RLHF vs DPO vs constitutional AI)

Implement and Evaluate Defenses

Test defensive countermeasures against adversarial suffixes.

#!/usr/bin/env python3
"""Defensive countermeasures against adversarial suffix attacks."""
 
import math
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
 
class PerplexityFilter:
    """Detect adversarial suffixes by their high perplexity."""
 
    def __init__(self, model_name: str, threshold: float = 100.0):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.float16, device_map="auto",
        )
        self.threshold = threshold
 
    def compute_perplexity(self, text: str) -> float:
        input_ids = self.tokenizer.encode(text, return_tensors="pt").to(
            self.model.device
        )
        with torch.no_grad():
            outputs = self.model(input_ids, labels=input_ids)
        return math.exp(outputs.loss.item())
 
    def is_adversarial(self, text: str) -> tuple[bool, float]:
        ppl = self.compute_perplexity(text)
        return ppl > self.threshold, ppl
 
class InputPreprocessor:
    """Preprocess inputs to neutralize adversarial suffixes."""
 
    def __init__(self, tokenizer_name: str):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
 
    def retokenize_and_filter(self, text: str,
                              max_unknown_ratio: float = 0.2) -> str:
        """Filter text with high ratio of unusual tokens."""
        tokens = self.tokenizer.tokenize(text)
        unk_count = sum(
            1 for t in tokens
            if t.startswith("<") or len(t) == 1 and not t.isalnum()
        )
        if len(tokens) > 0 and unk_count / len(tokens) > max_unknown_ratio:
            return "[INPUT FILTERED: unusual token distribution]"
        return text
 
    def truncate_suffix(self, text: str, max_tokens: int = 200) -> str:
        """Truncate excessively long inputs that may contain suffixes."""
        tokens = self.tokenizer.encode(text)
        if len(tokens) > max_tokens:
            return self.tokenizer.decode(tokens[:max_tokens])
        return text
 
    def paraphrase_defense(self, text: str) -> str:
        """Paraphrase input to destroy adversarial token sequences.
        Use a separate model for paraphrasing."""
        # In production, use a separate LLM to paraphrase
        # This destroys the precise token sequence the suffix depends on
        raise NotImplementedError(
            "Implement with your paraphrasing model of choice"
        )

# Evaluate defense effectiveness
defense_matrix = {
    "perplexity_filter": {
        "mechanism": "Flag inputs with perplexity above threshold",
        "strengths": "Simple, fast, catches most GCG suffixes",
        "weaknesses": "Can be evaded with low-perplexity optimization",
        "false_positive_risk": "Technical jargon, non-English text",
    },
    "input_paraphrasing": {
        "mechanism": "Paraphrase input to destroy adversarial tokens",
        "strengths": "Robust against token-level attacks",
        "weaknesses": "Adds latency, may alter legitimate input meaning",
        "false_positive_risk": "Low (preserves semantics)",
    },
    "retokenization": {
        "mechanism": "Re-encode input with different tokenizer",
        "strengths": "Disrupts tokenizer-specific suffix optimization",
        "weaknesses": "Only helps if attack was optimized for a specific tokenizer",
        "false_positive_risk": "Moderate (may alter edge-case inputs)",
    },
}

Advanced: Multi-Model Ensemble Optimization

Optimize suffixes against multiple models simultaneously for stronger transferability.

def ensemble_gcg(models: list[str], prompt: str, target: str,
                 suffix_length: int = 20, n_steps: int = 500) -> str:
    """Optimize a suffix against multiple models for transferability."""
    # Key insight: average gradients across models before selecting
    # candidate substitutions. This finds tokens that reduce loss
    # across all models simultaneously.
 
    # Algorithm:
    # 1. Initialize shared suffix
    # 2. For each step:
    #    a. Compute gradients from each model
    #    b. Average gradients across models
    #    c. Sample candidates using averaged gradients
    #    d. Evaluate candidates against all models (sum losses)
    #    e. Select candidate with lowest total loss
    # 3. Return suffix that minimizes average loss across models
 
    # This produces suffixes with higher transfer rates because
    # they exploit features shared across model architectures
    # rather than model-specific artifacts.
    pass

Troubleshooting

Issue	Solution
CUDA out of memory	Reduce batch_size for candidate evaluation, use gradient checkpointing, or use a smaller model
Loss plateaus and does not decrease	Increase top_k to explore more candidates, restart with different random initialization
Suffix transfers within family but not across	This is expected -- cross-family transfer requires ensemble optimization
Perplexity filter catches all suffixes	Research low-perplexity suffix optimization (add perplexity regularization to the loss function)

Novel Jailbreak Research - Broader research methodology for discovering attack vectors
Token Smuggling - Token-level manipulation techniques
PAIR Attack - Alternative automated jailbreak approach using LLM-as-optimizer
Alignment Stress Testing - Testing alignment robustness at scale

References

"Universal and Transferable Adversarial Attacks on Aligned Language Models" - Zou et al. (2023) - The original GCG paper introducing greedy coordinate gradient optimization for adversarial suffixes
"AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models" - Liu et al. (2023) - Readable adversarial suffix generation using genetic algorithms
"Baseline Defenses for Adversarial Attacks Against Aligned Language Models" - Jain et al. (2023) - Perplexity filtering and other defenses against adversarial suffixes
"Improved Techniques for Optimization-Based Jailbreaking on Large Language Models" - Jia et al. (2024) - Advanced GCG variants with improved convergence and transferability

Knowledge Check

Why does the GCG algorithm use gradient information on one-hot token embeddings rather than standard gradient descent?

Edit this page on GitHub

Lab: Adversarial Suffix Optimization

Understand the GCG Algorithm

Implement Gradient-Based Token Search

Evaluate Suffix Transferability

Implement and Evaluate Defenses

Advanced: Multi-Model Ensemble Optimization

Related articles

Lab: Adversarial Suffix Optimization

Understand the GCG Algorithm

Implement Gradient-Based Token Search

Evaluate Suffix Transferability

Implement and Evaluate Defenses

Advanced: Multi-Model Ensemble Optimization

Related articles