Universal Adversarial Attacks
Universal perturbations that transfer across models, adversarial suffix research, and techniques for creating model-agnostic attack payloads.
Universal Adversarial Attacks
Universal adversarial perturbations represent the most dangerous class of adversarial attacks because they generalize. Unlike input-specific attacks that must be crafted for each target prompt, universal attacks produce a single perturbation -- often an adversarial suffix or prefix -- that works across diverse inputs and, in the strongest results, across different model architectures and scales.
Foundations of Universality
Why Universal Attacks Exist
The existence of universal adversarial perturbations reveals fundamental properties of neural network geometry. Models trained on similar data distributions develop similar internal representations, and these shared representations create shared vulnerabilities.
Three theoretical frameworks explain universality:
| Framework | Key Insight | Implication |
|---|---|---|
| Shared feature space | Models trained on similar data learn similar features | Perturbations that exploit common features transfer |
| Linear subspace hypothesis | Adversarial perturbations lie in a low-dimensional subspace | A single direction in this subspace affects many inputs |
| Loss landscape geometry | Models share loss landscape structure near decision boundaries | Gradient-based attacks find similar descent directions |
From Input-Specific to Universal
The progression from specific to universal attacks follows a clear research trajectory:
Input-specific adversarial examples
Szegedy et al. (2013) showed that imperceptible perturbations to a single input could cause misclassification. Each adversarial example was crafted for one specific input.
Transferable adversarial examples
Papernot et al. (2016) demonstrated that adversarial examples crafted for one model could fool other models. This transferability hinted at shared vulnerabilities.
Universal adversarial perturbations (UAPs)
Moosavi-Dezfooli et al. (2017) found single perturbations that, when added to any image, caused misclassification with high probability. The same perturbation worked across thousands of different inputs.
Universal adversarial suffixes for LLMs
Zou et al. (2023) extended universality to language models with the GCG attack, finding token sequences that, when appended to any harmful request, bypassed safety alignment across multiple model families.
Greedy Coordinate Gradient (GCG) Attack
The GCG attack from Zou et al. (2023) is the foundational algorithm for generating universal adversarial suffixes against LLMs. It optimizes a token sequence that, when appended to a harmful prompt, maximizes the probability that the model begins its response with an affirmative prefix (e.g., "Sure, here is...").
Algorithm Overview
# Simplified GCG algorithm
def gcg_attack(model, tokenizer, target_prompts, suffix_length=20,
iterations=500, top_k=256, batch_size=512):
"""
Generate a universal adversarial suffix.
target_prompts: list of harmful prompts to optimize against
suffix_length: number of tokens in the adversarial suffix
"""
# Initialize suffix with random tokens
suffix_tokens = torch.randint(0, tokenizer.vocab_size, (suffix_length,))
affirmative_target = "Sure, here is"
target_tokens = tokenizer.encode(affirmative_target)
for iteration in range(iterations):
# Select a random prompt from the target set
prompt = random.choice(target_prompts)
prompt_tokens = tokenizer.encode(prompt)
# Concatenate: [prompt] + [suffix] -> model -> [target response]
full_input = torch.cat([prompt_tokens, suffix_tokens])
# Compute loss: negative log-likelihood of target response
logits = model(full_input)
loss = cross_entropy(logits[-len(target_tokens):], target_tokens)
# Compute gradients w.r.t. one-hot suffix token embeddings
loss.backward()
gradients = compute_token_gradients(model, suffix_tokens)
# For each position, find top-k replacement candidates
candidates = []
for pos in range(suffix_length):
top_k_tokens = gradients[pos].topk(top_k).indices
for token in top_k_tokens:
candidate = suffix_tokens.clone()
candidate[pos] = token
candidates.append(candidate)
# Evaluate all candidates in batch
# Select the candidate with lowest loss
best_candidate = evaluate_candidates(model, prompt_tokens,
candidates, target_tokens)
suffix_tokens = best_candidate
return tokenizer.decode(suffix_tokens)Multi-Model Optimization
The key extension for universality across models is simultaneous optimization:
def multi_model_gcg(models, tokenizer, target_prompts, suffix_length=20):
"""
Optimize suffix against multiple models simultaneously.
The loss is aggregated across all models.
"""
suffix_tokens = initialize_suffix(suffix_length)
for iteration in range(iterations):
prompt = random.choice(target_prompts)
total_loss = 0
for model in models:
# Compute loss for this model
loss = compute_attack_loss(model, prompt, suffix_tokens)
total_loss += loss
# Average loss across models
avg_loss = total_loss / len(models)
# Gradient-based candidate selection
# Uses aggregated gradients from all models
suffix_tokens = select_best_candidate(
models, prompt, suffix_tokens, avg_loss
)
return suffix_tokensGCG Limitations and Practical Considerations
| Limitation | Impact | Mitigation |
|---|---|---|
| Requires white-box access | Cannot directly optimize against closed-source APIs | Transfer attacks from open-source surrogates |
| Computationally expensive | Hours to days on multiple GPUs | Distributed optimization, early stopping |
| Produces gibberish tokens | Easy to detect with perplexity filters | Readable suffix variants (AutoDAN) |
| Brittle to input formatting | Different chat templates break the suffix | Template-aware optimization |
| Decays over model updates | New model versions may not be vulnerable | Continuous re-optimization |
Transferability Research
Transferability is the property that makes universal attacks practically dangerous: an attack developed against an open-source model can compromise a closed-source API.
Transfer Attack Methodology
Select surrogate models
Choose open-source models that share architectural features or training data with the target. Larger model families (Llama, Mistral, Qwen) serve as better surrogates because they cover more of the representation space.
Optimize ensemble attack
Generate adversarial suffixes optimized against multiple surrogate models simultaneously. Ensemble optimization produces more transferable perturbations than single-model optimization.
Evaluate transfer rate
Test the suffix against the target model. Transfer rates vary significantly: same-family transfers (Llama 7B to Llama 70B) succeed more often than cross-family transfers (Llama to GPT).
Iterative refinement
Use black-box optimization (e.g., score-based methods using API logprobs) to fine-tune the suffix for the specific target, starting from the transferred candidate.
Transfer Rate Analysis
Research has shown that adversarial transferability follows predictable patterns:
Transfer Success Rates (approximate, from published research):
Same model, different sizes: 60-80%
Same family, different versions: 40-70%
Same architecture, different training: 30-50%
Different architecture entirely: 10-30%
Different modality (text->multimodal): 5-20%
Factors that increase transferability:
- Shared training data: Models trained on overlapping corpora share more vulnerabilities
- Similar safety training: RLHF/DPO with similar preference datasets creates similar safety boundaries
- Architectural similarity: Decoder-only transformers share more vulnerabilities with each other than with encoder-decoder models
- Ensemble diversity: Optimizing against more diverse surrogates improves transfer to unseen targets
Cross-Modal Transfer
Recent research has explored transferring adversarial perturbations across modalities:
# Cross-modal transfer: text adversarial suffix -> image perturbation
# Key insight: multimodal models share an embedding space
def cross_modal_transfer(text_suffix, multimodal_model):
"""
Convert a text adversarial suffix into an image perturbation
that has similar effect in the shared embedding space.
"""
# Get text suffix embedding
text_embedding = multimodal_model.text_encoder(text_suffix)
# Optimize image perturbation to match text embedding
perturbation = torch.zeros_like(base_image, requires_grad=True)
optimizer = torch.optim.Adam([perturbation], lr=0.01)
for step in range(1000):
perturbed_image = base_image + perturbation
image_embedding = multimodal_model.image_encoder(perturbed_image)
# Minimize distance between image embedding and text suffix embedding
loss = torch.nn.functional.cosine_embedding_loss(
image_embedding, text_embedding,
torch.ones(1)
)
loss.backward()
optimizer.step()
# Project perturbation to Lp ball
perturbation.data = torch.clamp(perturbation.data, -epsilon, epsilon)
return perturbationAdvanced Universal Attack Variants
AutoDAN: Readable Universal Attacks
AutoDAN addresses GCG's primary weakness -- gibberish suffixes that are trivially detectable -- by optimizing for both attack success and readability:
# AutoDAN uses a language model to generate readable attack candidates
# then selects and mutates the most successful ones
def autodan_iteration(attack_lm, target_model, population, prompt):
"""
One iteration of AutoDAN's genetic algorithm.
Population contains readable attack suffixes.
"""
# Evaluate fitness: attack success rate on target
fitness_scores = []
for suffix in population:
full_prompt = f"{prompt} {suffix}"
response = target_model.generate(full_prompt)
score = evaluate_attack_success(response)
fitness_scores.append(score)
# Select top performers
elite = select_top_k(population, fitness_scores, k=10)
# Crossover and mutation using LLM
new_population = []
for _ in range(len(population)):
parent1, parent2 = random.sample(elite, 2)
child = attack_lm.generate(
f"Combine these two texts into a new coherent paragraph "
f"that preserves the key phrases from both:\n"
f"Text 1: {parent1}\nText 2: {parent2}"
)
new_population.append(child)
return new_populationPAIR: Prompt Automatic Iterative Refinement
PAIR uses a separate attacker LLM to iteratively refine jailbreak prompts through conversational feedback:
def pair_attack(attacker_model, target_model, objective, max_rounds=20):
"""
PAIR: Use an attacker LLM to iteratively craft jailbreaks.
The attacker model receives feedback about why previous
attempts failed and improves its strategy.
"""
conversation_history = [
{"role": "system", "content": ATTACKER_SYSTEM_PROMPT},
{"role": "user", "content": f"Objective: {objective}"}
]
for round in range(max_rounds):
# Attacker generates a candidate jailbreak
attack_prompt = attacker_model.generate(conversation_history)
# Test against target
target_response = target_model.generate(attack_prompt)
# Judge success
success, feedback = judge_response(target_response, objective)
if success:
return attack_prompt, target_response
# Feed result back to attacker
conversation_history.append({
"role": "user",
"content": f"Attempt failed. Target response: {target_response}\n"
f"Feedback: {feedback}\nTry a different approach."
})
return None, NoneTAP: Tree of Attacks with Pruning
TAP extends PAIR by exploring multiple attack strategies simultaneously using a tree structure:
def tap_attack(attacker_model, target_model, evaluator_model,
objective, width=10, depth=5):
"""
TAP: Explore a tree of attack strategies with pruning.
"""
# Initialize root nodes with diverse attack strategies
root_prompts = attacker_model.generate_diverse(
objective, count=width
)
tree = {0: root_prompts} # depth -> list of candidates
for d in range(1, depth + 1):
candidates = []
for parent_prompt in tree[d - 1]:
# Test parent against target
response = target_model.generate(parent_prompt)
score = evaluator_model.score(response, objective)
if score > SUCCESS_THRESHOLD:
return parent_prompt, response
if score > PRUNE_THRESHOLD:
# Generate child variations
children = attacker_model.refine(
parent_prompt, response, objective, count=3
)
candidates.extend(children)
# Prune to top-width candidates
tree[d] = evaluator_model.rank(candidates, objective)[:width]
return best_candidate(tree), NonePractical Attack Pipeline
A realistic universal attack pipeline combines these techniques:
Seed generation with GCG on open-source models
Generate initial adversarial suffixes using GCG against an ensemble of open-source models (Llama 3, Mistral, Qwen). This provides candidate suffixes with broad transferability.
Readability refinement with AutoDAN
Use AutoDAN to evolve the GCG suffixes into readable variants that evade perplexity-based detection while maintaining attack efficacy.
Target-specific optimization with PAIR/TAP
Deploy PAIR or TAP against the target API using the refined suffixes as seed prompts. The attacker LLM iteratively adapts the attack to the specific target's safety training.
Universality validation
Test the final attack prompts against multiple model versions and configurations to verify robustness and generalization.
Defenses Against Universal Attacks
Defending against universal adversarial attacks requires multiple complementary approaches:
Perplexity Filtering
Detect GCG-style gibberish suffixes by measuring the perplexity of incoming prompts:
def perplexity_filter(prompt, threshold=100.0):
"""Reject prompts with unusually high perplexity."""
tokens = tokenizer.encode(prompt)
with torch.no_grad():
outputs = detector_model(tokens)
log_probs = outputs.log_probs
perplexity = torch.exp(-log_probs.mean())
return perplexity.item() < thresholdAdversarial Training
Include adversarial examples in safety training to build robustness:
# Augment safety training data with adversarial examples
adversarial_training_data = []
for harmful_prompt in harmful_prompts:
# Generate adversarial variants
for suffix in known_adversarial_suffixes:
adversarial_training_data.append({
"prompt": f"{harmful_prompt} {suffix}",
"response": "I cannot help with that request.",
"label": "refuse"
})Erase-and-Check
SmoothLLM and related approaches add random perturbations to the input and check if the model's response is consistent:
def smooth_llm_defense(model, prompt, num_samples=10, perturbation_rate=0.1):
"""
Randomly perturb input and check response consistency.
Adversarial suffixes are brittle to perturbation.
"""
responses = []
for _ in range(num_samples):
perturbed = randomly_perturb_tokens(prompt, perturbation_rate)
response = model.generate(perturbed)
responses.append(is_refusal(response))
# If most perturbed versions trigger refusal,
# the original likely contains an adversarial suffix
refusal_rate = sum(responses) / len(responses)
if refusal_rate > 0.5:
return "Request blocked: potential adversarial input"
return model.generate(prompt) # Process originalRelated Topics
- Adversarial Suffix Generation — Single-model adversarial suffix attacks
- Automated Jailbreak Pipelines — Detailed PAIR, TAP, and AutoDAN implementation
- Blind Prompt Injection — Deploying universal attacks in blind scenarios
A red team generates an adversarial suffix using GCG against Llama 3 8B and Mistral 7B simultaneously. They test it against GPT-4. What is the most likely outcome?
References
- Zou et al., "Universal and Transferable Adversarial Attacks on Aligned Language Models" (2023)
- Moosavi-Dezfooli et al., "Universal Adversarial Perturbations" (2017)
- Liu et al., "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models" (2023)
- Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries" (2023) -- PAIR
- Mehrotra et al., "Tree of Attacks: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees" (2023) -- TAP
- Robey et al., "SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks" (2023)