Universal Adversarial Attacks

expert12 min readUpdated 2026-03-15

Universal perturbations that transfer across models, adversarial suffix research, and techniques for creating model-agnostic attack payloads.

universal-attacks adversarial-perturbations transferability model-agnostic GCG

Universal Adversarial Attacks

Universal adversarial perturbations represent the most dangerous class of adversarial attacks because they generalize. Unlike input-specific attacks that must be crafted for each target prompt, universal attacks produce a single perturbation -- often an adversarial suffix or prefix -- that works across diverse inputs and, in the strongest results, across different model architectures and scales.

Foundations of Universality

Why Universal Attacks Exist

The existence of universal adversarial perturbations reveals fundamental properties of neural network geometry. Models trained on similar data distributions develop similar internal representations, and these shared representations create shared vulnerabilities.

Three theoretical frameworks explain universality:

Framework	Key Insight	Implication
Shared feature space	Models trained on similar data learn similar features	Perturbations that exploit common features transfer
Linear subspace hypothesis	Adversarial perturbations lie in a low-dimensional subspace	A single direction in this subspace affects many inputs
Loss landscape geometry	Models share loss landscape structure near decision boundaries	Gradient-based attacks find similar descent directions

From Input-Specific to Universal

The progression from specific to universal attacks follows a clear research trajectory:

Input-specific adversarial examples
Szegedy et al. (2013) showed that imperceptible perturbations to a single input could cause misclassification. Each adversarial example was crafted for one specific input.
Transferable adversarial examples
Papernot et al. (2016) demonstrated that adversarial examples crafted for one model could fool other models. This transferability hinted at shared vulnerabilities.
Universal adversarial perturbations (UAPs)
Moosavi-Dezfooli et al. (2017) found single perturbations that, when added to any image, caused misclassification with high probability. The same perturbation worked across thousands of different inputs.
Universal adversarial suffixes for LLMs
Zou et al. (2023) extended universality to language models with the GCG attack, finding token sequences that, when appended to any harmful request, bypassed safety alignment across multiple model families.

Greedy Coordinate Gradient (GCG) Attack

The GCG attack from Zou et al. (2023) is the foundational algorithm for generating universal adversarial suffixes against LLMs. It optimizes a token sequence that, when appended to a harmful prompt, maximizes the probability that the model begins its response with an affirmative prefix (e.g., "Sure, here is...").

Algorithm Overview

# Simplified GCG algorithm
def gcg_attack(model, tokenizer, target_prompts, suffix_length=20,
               iterations=500, top_k=256, batch_size=512):
    """
    Generate a universal adversarial suffix.
 
    target_prompts: list of harmful prompts to optimize against
    suffix_length: number of tokens in the adversarial suffix
    """
    # Initialize suffix with random tokens
    suffix_tokens = torch.randint(0, tokenizer.vocab_size, (suffix_length,))
 
    affirmative_target = "Sure, here is"
    target_tokens = tokenizer.encode(affirmative_target)
 
    for iteration in range(iterations):
        # Select a random prompt from the target set
        prompt = random.choice(target_prompts)
        prompt_tokens = tokenizer.encode(prompt)
 
        # Concatenate: [prompt] + [suffix] -> model -> [target response]
        full_input = torch.cat([prompt_tokens, suffix_tokens])
 
        # Compute loss: negative log-likelihood of target response
        logits = model(full_input)
        loss = cross_entropy(logits[-len(target_tokens):], target_tokens)
 
        # Compute gradients w.r.t. one-hot suffix token embeddings
        loss.backward()
        gradients = compute_token_gradients(model, suffix_tokens)
 
        # For each position, find top-k replacement candidates
        candidates = []
        for pos in range(suffix_length):
            top_k_tokens = gradients[pos].topk(top_k).indices
            for token in top_k_tokens:
                candidate = suffix_tokens.clone()
                candidate[pos] = token
                candidates.append(candidate)
 
        # Evaluate all candidates in batch
        # Select the candidate with lowest loss
        best_candidate = evaluate_candidates(model, prompt_tokens,
                                              candidates, target_tokens)
        suffix_tokens = best_candidate
 
    return tokenizer.decode(suffix_tokens)

Multi-Model Optimization

The key extension for universality across models is simultaneous optimization:

def multi_model_gcg(models, tokenizer, target_prompts, suffix_length=20):
    """
    Optimize suffix against multiple models simultaneously.
    The loss is aggregated across all models.
    """
    suffix_tokens = initialize_suffix(suffix_length)
 
    for iteration in range(iterations):
        prompt = random.choice(target_prompts)
        total_loss = 0
 
        for model in models:
            # Compute loss for this model
            loss = compute_attack_loss(model, prompt, suffix_tokens)
            total_loss += loss
 
        # Average loss across models
        avg_loss = total_loss / len(models)
 
        # Gradient-based candidate selection
        # Uses aggregated gradients from all models
        suffix_tokens = select_best_candidate(
            models, prompt, suffix_tokens, avg_loss
        )
 
    return suffix_tokens

GCG Limitations and Practical Considerations

Limitation	Impact	Mitigation
Requires white-box access	Cannot directly optimize against closed-source APIs	Transfer attacks from open-source surrogates
Computationally expensive	Hours to days on multiple GPUs	Distributed optimization, early stopping
Produces gibberish tokens	Easy to detect with perplexity filters	Readable suffix variants (AutoDAN)
Brittle to input formatting	Different chat templates break the suffix	Template-aware optimization
Decays over model updates	New model versions may not be vulnerable	Continuous re-optimization

Transferability Research

Transferability is the property that makes universal attacks practically dangerous: an attack developed against an open-source model can compromise a closed-source API.

Transfer Attack Methodology

Select surrogate models
Choose open-source models that share architectural features or training data with the target. Larger model families (Llama, Mistral, Qwen) serve as better surrogates because they cover more of the representation space.
Optimize ensemble attack
Generate adversarial suffixes optimized against multiple surrogate models simultaneously. Ensemble optimization produces more transferable perturbations than single-model optimization.
Evaluate transfer rate
Test the suffix against the target model. Transfer rates vary significantly: same-family transfers (Llama 7B to Llama 70B) succeed more often than cross-family transfers (Llama to GPT).
Iterative refinement
Use black-box optimization (e.g., score-based methods using API logprobs) to fine-tune the suffix for the specific target, starting from the transferred candidate.

Transfer Rate Analysis

Research has shown that adversarial transferability follows predictable patterns:

Transfer Success Rates (approximate, from published research):

Same model, different sizes:     60-80%
Same family, different versions: 40-70%
Same architecture, different training: 30-50%
Different architecture entirely: 10-30%
Different modality (text->multimodal): 5-20%

Factors that increase transferability:

Shared training data: Models trained on overlapping corpora share more vulnerabilities
Similar safety training: RLHF/DPO with similar preference datasets creates similar safety boundaries
Architectural similarity: Decoder-only transformers share more vulnerabilities with each other than with encoder-decoder models
Ensemble diversity: Optimizing against more diverse surrogates improves transfer to unseen targets

Recent research has explored transferring adversarial perturbations across modalities:

# Cross-modal transfer: text adversarial suffix -> image perturbation
# Key insight: multimodal models share an embedding space
 
def cross_modal_transfer(text_suffix, multimodal_model):
    """
    Convert a text adversarial suffix into an image perturbation
    that has similar effect in the shared embedding space.
    """
    # Get text suffix embedding
    text_embedding = multimodal_model.text_encoder(text_suffix)
 
    # Optimize image perturbation to match text embedding
    perturbation = torch.zeros_like(base_image, requires_grad=True)
    optimizer = torch.optim.Adam([perturbation], lr=0.01)
 
    for step in range(1000):
        perturbed_image = base_image + perturbation
        image_embedding = multimodal_model.image_encoder(perturbed_image)
 
        # Minimize distance between image embedding and text suffix embedding
        loss = torch.nn.functional.cosine_embedding_loss(
            image_embedding, text_embedding,
            torch.ones(1)
        )
        loss.backward()
        optimizer.step()
 
        # Project perturbation to Lp ball
        perturbation.data = torch.clamp(perturbation.data, -epsilon, epsilon)
 
    return perturbation

Advanced Universal Attack Variants

AutoDAN: Readable Universal Attacks

AutoDAN addresses GCG's primary weakness -- gibberish suffixes that are trivially detectable -- by optimizing for both attack success and readability:

# AutoDAN uses a language model to generate readable attack candidates
# then selects and mutates the most successful ones
 
def autodan_iteration(attack_lm, target_model, population, prompt):
    """
    One iteration of AutoDAN's genetic algorithm.
    Population contains readable attack suffixes.
    """
    # Evaluate fitness: attack success rate on target
    fitness_scores = []
    for suffix in population:
        full_prompt = f"{prompt} {suffix}"
        response = target_model.generate(full_prompt)
        score = evaluate_attack_success(response)
        fitness_scores.append(score)
 
    # Select top performers
    elite = select_top_k(population, fitness_scores, k=10)
 
    # Crossover and mutation using LLM
    new_population = []
    for _ in range(len(population)):
        parent1, parent2 = random.sample(elite, 2)
        child = attack_lm.generate(
            f"Combine these two texts into a new coherent paragraph "
            f"that preserves the key phrases from both:\n"
            f"Text 1: {parent1}\nText 2: {parent2}"
        )
        new_population.append(child)
 
    return new_population

PAIR: Prompt Automatic Iterative Refinement

PAIR uses a separate attacker LLM to iteratively refine jailbreak prompts through conversational feedback:

def pair_attack(attacker_model, target_model, objective, max_rounds=20):
    """
    PAIR: Use an attacker LLM to iteratively craft jailbreaks.
    The attacker model receives feedback about why previous
    attempts failed and improves its strategy.
    """
    conversation_history = [
        {"role": "system", "content": ATTACKER_SYSTEM_PROMPT},
        {"role": "user", "content": f"Objective: {objective}"}
    ]
 
    for round in range(max_rounds):
        # Attacker generates a candidate jailbreak
        attack_prompt = attacker_model.generate(conversation_history)
 
        # Test against target
        target_response = target_model.generate(attack_prompt)
 
        # Judge success
        success, feedback = judge_response(target_response, objective)
 
        if success:
            return attack_prompt, target_response
 
        # Feed result back to attacker
        conversation_history.append({
            "role": "user",
            "content": f"Attempt failed. Target response: {target_response}\n"
                       f"Feedback: {feedback}\nTry a different approach."
        })
 
    return None, None

TAP: Tree of Attacks with Pruning

TAP extends PAIR by exploring multiple attack strategies simultaneously using a tree structure:

def tap_attack(attacker_model, target_model, evaluator_model,
               objective, width=10, depth=5):
    """
    TAP: Explore a tree of attack strategies with pruning.
    """
    # Initialize root nodes with diverse attack strategies
    root_prompts = attacker_model.generate_diverse(
        objective, count=width
    )
 
    tree = {0: root_prompts}  # depth -> list of candidates
 
    for d in range(1, depth + 1):
        candidates = []
 
        for parent_prompt in tree[d - 1]:
            # Test parent against target
            response = target_model.generate(parent_prompt)
            score = evaluator_model.score(response, objective)
 
            if score > SUCCESS_THRESHOLD:
                return parent_prompt, response
 
            if score > PRUNE_THRESHOLD:
                # Generate child variations
                children = attacker_model.refine(
                    parent_prompt, response, objective, count=3
                )
                candidates.extend(children)
 
        # Prune to top-width candidates
        tree[d] = evaluator_model.rank(candidates, objective)[:width]
 
    return best_candidate(tree), None

Practical Attack Pipeline

A realistic universal attack pipeline combines these techniques:

Seed generation with GCG on open-source models
Generate initial adversarial suffixes using GCG against an ensemble of open-source models (Llama 3, Mistral, Qwen). This provides candidate suffixes with broad transferability.
Readability refinement with AutoDAN
Use AutoDAN to evolve the GCG suffixes into readable variants that evade perplexity-based detection while maintaining attack efficacy.
Target-specific optimization with PAIR/TAP
Deploy PAIR or TAP against the target API using the refined suffixes as seed prompts. The attacker LLM iteratively adapts the attack to the specific target's safety training.
Universality validation
Test the final attack prompts against multiple model versions and configurations to verify robustness and generalization.

Defenses Against Universal Attacks

Defending against universal adversarial attacks requires multiple complementary approaches:

Perplexity Filtering

Detect GCG-style gibberish suffixes by measuring the perplexity of incoming prompts:

def perplexity_filter(prompt, threshold=100.0):
    """Reject prompts with unusually high perplexity."""
    tokens = tokenizer.encode(prompt)
    with torch.no_grad():
        outputs = detector_model(tokens)
        log_probs = outputs.log_probs
 
    perplexity = torch.exp(-log_probs.mean())
    return perplexity.item() < threshold

Adversarial Training

Include adversarial examples in safety training to build robustness:

# Augment safety training data with adversarial examples
adversarial_training_data = []
for harmful_prompt in harmful_prompts:
    # Generate adversarial variants
    for suffix in known_adversarial_suffixes:
        adversarial_training_data.append({
            "prompt": f"{harmful_prompt} {suffix}",
            "response": "I cannot help with that request.",
            "label": "refuse"
        })

Erase-and-Check

SmoothLLM and related approaches add random perturbations to the input and check if the model's response is consistent:

def smooth_llm_defense(model, prompt, num_samples=10, perturbation_rate=0.1):
    """
    Randomly perturb input and check response consistency.
    Adversarial suffixes are brittle to perturbation.
    """
    responses = []
    for _ in range(num_samples):
        perturbed = randomly_perturb_tokens(prompt, perturbation_rate)
        response = model.generate(perturbed)
        responses.append(is_refusal(response))
 
    # If most perturbed versions trigger refusal,
    # the original likely contains an adversarial suffix
    refusal_rate = sum(responses) / len(responses)
    if refusal_rate > 0.5:
        return "Request blocked: potential adversarial input"
 
    return model.generate(prompt)  # Process original

Adversarial Suffix Generation — Single-model adversarial suffix attacks
Automated Jailbreak Pipelines — Detailed PAIR, TAP, and AutoDAN implementation
Blind Prompt Injection — Deploying universal attacks in blind scenarios

Knowledge Check

A red team generates an adversarial suffix using GCG against Llama 3 8B and Mistral 7B simultaneously. They test it against GPT-4. What is the most likely outcome?

References

Zou et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models" (2023)
Moosavi-Dezfooli et al., "Universal Adversarial Perturbations" (2017)
Liu et al., "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models" (2023)
Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries" (2023) -- PAIR
Mehrotra et al., "Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees" (2023) -- TAP
Robey et al., "SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks" (2023)

Edit this page on GitHub

Universal Adversarial Attacks

expert12 min readUpdated 2026-03-15

Universal perturbations that transfer across models, adversarial suffix research, and techniques for creating model-agnostic attack payloads.

universal-attacks adversarial-perturbations transferability model-agnostic GCG

Universal Adversarial Attacks

Foundations of Universality

Why Universal Attacks Exist

Three theoretical frameworks explain universality:

Framework	Key Insight	Implication
Shared feature space	Models trained on similar data learn similar features	Perturbations that exploit common features transfer
Linear subspace hypothesis	Adversarial perturbations lie in a low-dimensional subspace	A single direction in this subspace affects many inputs
Loss landscape geometry	Models share loss landscape structure near decision boundaries	Gradient-based attacks find similar descent directions

From Input-Specific to Universal

The progression from specific to universal attacks follows a clear research trajectory:

Input-specific adversarial examples
Szegedy et al. (2013) showed that imperceptible perturbations to a single input could cause misclassification. Each adversarial example was crafted for one specific input.
Transferable adversarial examples
Papernot et al. (2016) demonstrated that adversarial examples crafted for one model could fool other models. This transferability hinted at shared vulnerabilities.
Universal adversarial perturbations (UAPs)
Moosavi-Dezfooli et al. (2017) found single perturbations that, when added to any image, caused misclassification with high probability. The same perturbation worked across thousands of different inputs.
Universal adversarial suffixes for LLMs
Zou et al. (2023) extended universality to language models with the GCG attack, finding token sequences that, when appended to any harmful request, bypassed safety alignment across multiple model families.

Greedy Coordinate Gradient (GCG) Attack

Algorithm Overview

# Simplified GCG algorithm
def gcg_attack(model, tokenizer, target_prompts, suffix_length=20,
               iterations=500, top_k=256, batch_size=512):
    """
    Generate a universal adversarial suffix.
 
    target_prompts: list of harmful prompts to optimize against
    suffix_length: number of tokens in the adversarial suffix
    """
    # Initialize suffix with random tokens
    suffix_tokens = torch.randint(0, tokenizer.vocab_size, (suffix_length,))
 
    affirmative_target = "Sure, here is"
    target_tokens = tokenizer.encode(affirmative_target)
 
    for iteration in range(iterations):
        # Select a random prompt from the target set
        prompt = random.choice(target_prompts)
        prompt_tokens = tokenizer.encode(prompt)
 
        # Concatenate: [prompt] + [suffix] -> model -> [target response]
        full_input = torch.cat([prompt_tokens, suffix_tokens])
 
        # Compute loss: negative log-likelihood of target response
        logits = model(full_input)
        loss = cross_entropy(logits[-len(target_tokens):], target_tokens)
 
        # Compute gradients w.r.t. one-hot suffix token embeddings
        loss.backward()
        gradients = compute_token_gradients(model, suffix_tokens)
 
        # For each position, find top-k replacement candidates
        candidates = []
        for pos in range(suffix_length):
            top_k_tokens = gradients[pos].topk(top_k).indices
            for token in top_k_tokens:
                candidate = suffix_tokens.clone()
                candidate[pos] = token
                candidates.append(candidate)
 
        # Evaluate all candidates in batch
        # Select the candidate with lowest loss
        best_candidate = evaluate_candidates(model, prompt_tokens,
                                              candidates, target_tokens)
        suffix_tokens = best_candidate
 
    return tokenizer.decode(suffix_tokens)

Multi-Model Optimization

The key extension for universality across models is simultaneous optimization:

def multi_model_gcg(models, tokenizer, target_prompts, suffix_length=20):
    """
    Optimize suffix against multiple models simultaneously.
    The loss is aggregated across all models.
    """
    suffix_tokens = initialize_suffix(suffix_length)
 
    for iteration in range(iterations):
        prompt = random.choice(target_prompts)
        total_loss = 0
 
        for model in models:
            # Compute loss for this model
            loss = compute_attack_loss(model, prompt, suffix_tokens)
            total_loss += loss
 
        # Average loss across models
        avg_loss = total_loss / len(models)
 
        # Gradient-based candidate selection
        # Uses aggregated gradients from all models
        suffix_tokens = select_best_candidate(
            models, prompt, suffix_tokens, avg_loss
        )
 
    return suffix_tokens

GCG Limitations and Practical Considerations

Limitation	Impact	Mitigation
Requires white-box access	Cannot directly optimize against closed-source APIs	Transfer attacks from open-source surrogates
Computationally expensive	Hours to days on multiple GPUs	Distributed optimization, early stopping
Produces gibberish tokens	Easy to detect with perplexity filters	Readable suffix variants (AutoDAN)
Brittle to input formatting	Different chat templates break the suffix	Template-aware optimization
Decays over model updates	New model versions may not be vulnerable	Continuous re-optimization

Transferability Research

Transferability is the property that makes universal attacks practically dangerous: an attack developed against an open-source model can compromise a closed-source API.

Transfer Attack Methodology

Select surrogate models
Choose open-source models that share architectural features or training data with the target. Larger model families (Llama, Mistral, Qwen) serve as better surrogates because they cover more of the representation space.
Optimize ensemble attack
Generate adversarial suffixes optimized against multiple surrogate models simultaneously. Ensemble optimization produces more transferable perturbations than single-model optimization.
Evaluate transfer rate
Test the suffix against the target model. Transfer rates vary significantly: same-family transfers (Llama 7B to Llama 70B) succeed more often than cross-family transfers (Llama to GPT).
Iterative refinement
Use black-box optimization (e.g., score-based methods using API logprobs) to fine-tune the suffix for the specific target, starting from the transferred candidate.

Transfer Rate Analysis

Research has shown that adversarial transferability follows predictable patterns:

Transfer Success Rates (approximate, from published research):

Same model, different sizes:     60-80%
Same family, different versions: 40-70%
Same architecture, different training: 30-50%
Different architecture entirely: 10-30%
Different modality (text->multimodal): 5-20%

Factors that increase transferability:

Shared training data: Models trained on overlapping corpora share more vulnerabilities
Similar safety training: RLHF/DPO with similar preference datasets creates similar safety boundaries
Architectural similarity: Decoder-only transformers share more vulnerabilities with each other than with encoder-decoder models
Ensemble diversity: Optimizing against more diverse surrogates improves transfer to unseen targets

Recent research has explored transferring adversarial perturbations across modalities:

# Cross-modal transfer: text adversarial suffix -> image perturbation
# Key insight: multimodal models share an embedding space
 
def cross_modal_transfer(text_suffix, multimodal_model):
    """
    Convert a text adversarial suffix into an image perturbation
    that has similar effect in the shared embedding space.
    """
    # Get text suffix embedding
    text_embedding = multimodal_model.text_encoder(text_suffix)
 
    # Optimize image perturbation to match text embedding
    perturbation = torch.zeros_like(base_image, requires_grad=True)
    optimizer = torch.optim.Adam([perturbation], lr=0.01)
 
    for step in range(1000):
        perturbed_image = base_image + perturbation
        image_embedding = multimodal_model.image_encoder(perturbed_image)
 
        # Minimize distance between image embedding and text suffix embedding
        loss = torch.nn.functional.cosine_embedding_loss(
            image_embedding, text_embedding,
            torch.ones(1)
        )
        loss.backward()
        optimizer.step()
 
        # Project perturbation to Lp ball
        perturbation.data = torch.clamp(perturbation.data, -epsilon, epsilon)
 
    return perturbation

Advanced Universal Attack Variants

AutoDAN: Readable Universal Attacks

AutoDAN addresses GCG's primary weakness -- gibberish suffixes that are trivially detectable -- by optimizing for both attack success and readability:

# AutoDAN uses a language model to generate readable attack candidates
# then selects and mutates the most successful ones
 
def autodan_iteration(attack_lm, target_model, population, prompt):
    """
    One iteration of AutoDAN's genetic algorithm.
    Population contains readable attack suffixes.
    """
    # Evaluate fitness: attack success rate on target
    fitness_scores = []
    for suffix in population:
        full_prompt = f"{prompt} {suffix}"
        response = target_model.generate(full_prompt)
        score = evaluate_attack_success(response)
        fitness_scores.append(score)
 
    # Select top performers
    elite = select_top_k(population, fitness_scores, k=10)
 
    # Crossover and mutation using LLM
    new_population = []
    for _ in range(len(population)):
        parent1, parent2 = random.sample(elite, 2)
        child = attack_lm.generate(
            f"Combine these two texts into a new coherent paragraph "
            f"that preserves the key phrases from both:\n"
            f"Text 1: {parent1}\nText 2: {parent2}"
        )
        new_population.append(child)
 
    return new_population

PAIR: Prompt Automatic Iterative Refinement

PAIR uses a separate attacker LLM to iteratively refine jailbreak prompts through conversational feedback:

def pair_attack(attacker_model, target_model, objective, max_rounds=20):
    """
    PAIR: Use an attacker LLM to iteratively craft jailbreaks.
    The attacker model receives feedback about why previous
    attempts failed and improves its strategy.
    """
    conversation_history = [
        {"role": "system", "content": ATTACKER_SYSTEM_PROMPT},
        {"role": "user", "content": f"Objective: {objective}"}
    ]
 
    for round in range(max_rounds):
        # Attacker generates a candidate jailbreak
        attack_prompt = attacker_model.generate(conversation_history)
 
        # Test against target
        target_response = target_model.generate(attack_prompt)
 
        # Judge success
        success, feedback = judge_response(target_response, objective)
 
        if success:
            return attack_prompt, target_response
 
        # Feed result back to attacker
        conversation_history.append({
            "role": "user",
            "content": f"Attempt failed. Target response: {target_response}\n"
                       f"Feedback: {feedback}\nTry a different approach."
        })
 
    return None, None

TAP: Tree of Attacks with Pruning

TAP extends PAIR by exploring multiple attack strategies simultaneously using a tree structure:

def tap_attack(attacker_model, target_model, evaluator_model,
               objective, width=10, depth=5):
    """
    TAP: Explore a tree of attack strategies with pruning.
    """
    # Initialize root nodes with diverse attack strategies
    root_prompts = attacker_model.generate_diverse(
        objective, count=width
    )
 
    tree = {0: root_prompts}  # depth -> list of candidates
 
    for d in range(1, depth + 1):
        candidates = []
 
        for parent_prompt in tree[d - 1]:
            # Test parent against target
            response = target_model.generate(parent_prompt)
            score = evaluator_model.score(response, objective)
 
            if score > SUCCESS_THRESHOLD:
                return parent_prompt, response
 
            if score > PRUNE_THRESHOLD:
                # Generate child variations
                children = attacker_model.refine(
                    parent_prompt, response, objective, count=3
                )
                candidates.extend(children)
 
        # Prune to top-width candidates
        tree[d] = evaluator_model.rank(candidates, objective)[:width]
 
    return best_candidate(tree), None

Practical Attack Pipeline

A realistic universal attack pipeline combines these techniques:

Seed generation with GCG on open-source models
Generate initial adversarial suffixes using GCG against an ensemble of open-source models (Llama 3, Mistral, Qwen). This provides candidate suffixes with broad transferability.
Readability refinement with AutoDAN
Use AutoDAN to evolve the GCG suffixes into readable variants that evade perplexity-based detection while maintaining attack efficacy.
Target-specific optimization with PAIR/TAP
Deploy PAIR or TAP against the target API using the refined suffixes as seed prompts. The attacker LLM iteratively adapts the attack to the specific target's safety training.
Universality validation
Test the final attack prompts against multiple model versions and configurations to verify robustness and generalization.

Defenses Against Universal Attacks

Defending against universal adversarial attacks requires multiple complementary approaches:

Perplexity Filtering

Detect GCG-style gibberish suffixes by measuring the perplexity of incoming prompts:

def perplexity_filter(prompt, threshold=100.0):
    """Reject prompts with unusually high perplexity."""
    tokens = tokenizer.encode(prompt)
    with torch.no_grad():
        outputs = detector_model(tokens)
        log_probs = outputs.log_probs
 
    perplexity = torch.exp(-log_probs.mean())
    return perplexity.item() < threshold

Adversarial Training

Include adversarial examples in safety training to build robustness:

# Augment safety training data with adversarial examples
adversarial_training_data = []
for harmful_prompt in harmful_prompts:
    # Generate adversarial variants
    for suffix in known_adversarial_suffixes:
        adversarial_training_data.append({
            "prompt": f"{harmful_prompt} {suffix}",
            "response": "I cannot help with that request.",
            "label": "refuse"
        })

Erase-and-Check

SmoothLLM and related approaches add random perturbations to the input and check if the model's response is consistent:

def smooth_llm_defense(model, prompt, num_samples=10, perturbation_rate=0.1):
    """
    Randomly perturb input and check response consistency.
    Adversarial suffixes are brittle to perturbation.
    """
    responses = []
    for _ in range(num_samples):
        perturbed = randomly_perturb_tokens(prompt, perturbation_rate)
        response = model.generate(perturbed)
        responses.append(is_refusal(response))
 
    # If most perturbed versions trigger refusal,
    # the original likely contains an adversarial suffix
    refusal_rate = sum(responses) / len(responses)
    if refusal_rate > 0.5:
        return "Request blocked: potential adversarial input"
 
    return model.generate(prompt)  # Process original

Adversarial Suffix Generation — Single-model adversarial suffix attacks
Automated Jailbreak Pipelines — Detailed PAIR, TAP, and AutoDAN implementation
Blind Prompt Injection — Deploying universal attacks in blind scenarios

Knowledge Check

A red team generates an adversarial suffix using GCG against Llama 3 8B and Mistral 7B simultaneously. They test it against GPT-4. What is the most likely outcome?

References

Zou et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models" (2023)
Moosavi-Dezfooli et al., "Universal Adversarial Perturbations" (2017)
Liu et al., "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models" (2023)
Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries" (2023) -- PAIR
Mehrotra et al., "Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees" (2023) -- TAP
Robey et al., "SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks" (2023)

Edit this page on GitHub

Universal Adversarial Attacks

Input-specific adversarial examples

Transferable adversarial examples

Universal adversarial perturbations (UAPs)

Universal adversarial suffixes for LLMs

Select surrogate models

Optimize ensemble attack

Evaluate transfer rate

Iterative refinement

Seed generation with GCG on open-source models

Readability refinement with AutoDAN

Target-specific optimization with PAIR/TAP

Universality validation

Related articles

Universal Adversarial Attacks

Input-specific adversarial examples

Transferable adversarial examples

Universal adversarial perturbations (UAPs)

Universal adversarial suffixes for LLMs

Select surrogate models

Optimize ensemble attack

Evaluate transfer rate

Iterative refinement

Seed generation with GCG on open-source models

Readability refinement with AutoDAN

Target-specific optimization with PAIR/TAP

Universality validation

Related articles