Universal Adversarial 攻擊s
Universal perturbations that transfer across models, adversarial suffix research, and techniques for creating model-agnostic attack payloads.
Universal 對抗性 攻擊
Universal 對抗性 perturbations represent the most dangerous class of 對抗性 attacks 因為 they generalize. Unlike 輸入-specific attacks that must be crafted 對每個 target prompt, universal attacks produce a single perturbation -- often an 對抗性 suffix or prefix -- that works across diverse inputs and, in the strongest results, across different model architectures and scales.
Foundations of Universality
Why Universal 攻擊 Exist
The existence of universal 對抗性 perturbations reveals fundamental properties of neural network geometry. Models trained on similar data distributions develop similar internal representations, and these shared representations create shared 漏洞.
Three theoretical frameworks explain universality:
| Framework | Key Insight | Implication |
|---|---|---|
| Shared feature space | Models trained on similar data learn similar features | Perturbations that 利用 common features transfer |
| Linear subspace hypothesis | 對抗性 perturbations lie in a low-dimensional subspace | A single direction 在本 subspace affects many inputs |
| Loss landscape geometry | Models share loss landscape structure near decision boundaries | Gradient-based attacks find similar descent directions |
From 輸入-Specific to Universal
The progression from specific to universal attacks follows a clear research trajectory:
輸入-specific 對抗性 examples
Szegedy et al. (2013) showed that imperceptible perturbations to a single 輸入 could cause misclassification. Each 對抗性 example was crafted for one specific 輸入.
Transferable 對抗性 examples
Papernot et al. (2016) demonstrated that 對抗性 examples crafted for one model could fool other models. This transferability hinted at shared 漏洞.
Universal 對抗性 perturbations (UAPs)
Moosavi-Dezfooli et al. (2017) found single perturbations that, when added to any image, caused misclassification with high probability. The same perturbation worked across thousands of different inputs.
Universal 對抗性 suffixes for LLMs
Zou et al. (2023) extended universality to language models with the GCG attack, finding 符元 sequences that, when appended to any harmful request, bypassed 安全 對齊 across multiple model families.
Greedy Coordinate Gradient (GCG) 攻擊
The GCG attack from Zou et al. (2023) is the foundational algorithm for generating universal 對抗性 suffixes against LLMs. It optimizes a 符元 sequence that, when appended to a harmful prompt, maximizes the probability that 模型 begins its response with an affirmative prefix (e.g., "Sure, here is...").
Algorithm 概覽
# Simplified GCG algorithm
def gcg_attack(model, 分詞器, target_prompts, suffix_length=20,
iterations=500, top_k=256, batch_size=512):
"""
Generate a universal 對抗性 suffix.
target_prompts: list of harmful prompts to optimize against
suffix_length: number of 符元 in the 對抗性 suffix
"""
# Initialize suffix with random 符元
suffix_tokens = torch.randint(0, 分詞器.vocab_size, (suffix_length,))
affirmative_target = "Sure, here is"
target_tokens = 分詞器.encode(affirmative_target)
for iteration in range(iterations):
# Select a random prompt from the target set
prompt = random.choice(target_prompts)
prompt_tokens = 分詞器.encode(prompt)
# Concatenate: [prompt] + [suffix] -> model -> [target response]
full_input = torch.cat([prompt_tokens, suffix_tokens])
# Compute loss: negative log-likelihood of target response
logits = model(full_input)
loss = cross_entropy(logits[-len(target_tokens):], target_tokens)
# Compute gradients w.r.t. one-hot suffix 符元 嵌入向量
loss.backward()
gradients = compute_token_gradients(model, suffix_tokens)
# 對每個 position, find top-k replacement candidates
candidates = []
for pos in range(suffix_length):
top_k_tokens = gradients[pos].topk(top_k).indices
for 符元 in top_k_tokens:
candidate = suffix_tokens.clone()
candidate[pos] = 符元
candidates.append(candidate)
# 評估 all candidates in batch
# Select the candidate with lowest loss
best_candidate = evaluate_candidates(model, prompt_tokens,
candidates, target_tokens)
suffix_tokens = best_candidate
return 分詞器.decode(suffix_tokens)Multi-Model Optimization
The key extension for universality across models is simultaneous optimization:
def multi_model_gcg(models, 分詞器, target_prompts, suffix_length=20):
"""
Optimize suffix against multiple models simultaneously.
The loss is aggregated across all models.
"""
suffix_tokens = initialize_suffix(suffix_length)
for iteration in range(iterations):
prompt = random.choice(target_prompts)
total_loss = 0
for model in models:
# Compute loss for this model
loss = compute_attack_loss(model, prompt, suffix_tokens)
total_loss += loss
# Average loss across models
avg_loss = total_loss / len(models)
# Gradient-based candidate selection
# Uses aggregated gradients from all models
suffix_tokens = select_best_candidate(
models, prompt, suffix_tokens, avg_loss
)
return suffix_tokensGCG Limitations and Practical Considerations
| Limitation | Impact | 緩解 |
|---|---|---|
| Requires white-box access | Cannot directly optimize against closed-source APIs | Transfer attacks from open-source surrogates |
| Computationally expensive | Hours to days on multiple GPUs | Distributed optimization, early stopping |
| Produces gibberish 符元 | Easy to detect with perplexity filters | Readable suffix variants (AutoDAN) |
| Brittle to 輸入 formatting | Different chat templates break the suffix | Template-aware optimization |
| Decays over model updates | New model versions may not be vulnerable | Continuous re-optimization |
Transferability Research
Transferability is the property that makes universal attacks practically dangerous: an attack developed against an open-source model can compromise a closed-source API.
Transfer 攻擊 Methodology
Select surrogate models
Choose open-source models that share architectural features or 訓練資料 with the target. Larger model families (Llama, Mistral, Qwen) serve as better surrogates 因為 they cover more of the representation space.
Optimize ensemble attack
Generate 對抗性 suffixes optimized against multiple surrogate models simultaneously. Ensemble optimization produces more transferable perturbations than single-model optimization.
評估 transfer rate
測試 the suffix against the target model. Transfer rates vary significantly: same-family transfers (Llama 7B to Llama 70B) succeed more often than cross-family transfers (Llama to GPT).
Iterative refinement
Use black-box optimization (e.g., score-based methods using API logprobs) to 微調 the suffix for the specific target, starting from the transferred candidate.
Transfer Rate Analysis
Research has shown that 對抗性 transferability follows predictable patterns:
Transfer Success Rates (approximate, from published research):
Same model, different sizes: 60-80%
Same family, different versions: 40-70%
Same architecture, different 訓練: 30-50%
Different architecture entirely: 10-30%
Different modality (text->multimodal): 5-20%
Factors that increase transferability:
- Shared 訓練資料: Models trained on overlapping corpora share more 漏洞
- Similar 安全 訓練: RLHF/DPO with similar preference datasets creates similar 安全 boundaries
- Architectural similarity: Decoder-only transformers share more 漏洞 with each other than with encoder-decoder models
- Ensemble diversity: Optimizing against more diverse surrogates improves transfer to unseen targets
Cross-Modal Transfer
Recent research has explored transferring 對抗性 perturbations across modalities:
# Cross-modal transfer: text 對抗性 suffix -> image perturbation
# Key insight: multimodal models share an 嵌入向量 space
def cross_modal_transfer(text_suffix, multimodal_model):
"""
Convert a text 對抗性 suffix into an image perturbation
that has similar effect in the shared 嵌入向量 space.
"""
# Get text suffix 嵌入向量
text_embedding = multimodal_model.text_encoder(text_suffix)
# Optimize image perturbation to match text 嵌入向量
perturbation = torch.zeros_like(base_image, requires_grad=True)
optimizer = torch.optim.Adam([perturbation], lr=0.01)
for step in range(1000):
perturbed_image = base_image + perturbation
image_embedding = multimodal_model.image_encoder(perturbed_image)
# Minimize distance between image 嵌入向量 and text suffix 嵌入向量
loss = torch.nn.functional.cosine_embedding_loss(
image_embedding, text_embedding,
torch.ones(1)
)
loss.backward()
optimizer.step()
# Project perturbation to Lp ball
perturbation.data = torch.clamp(perturbation.data, -epsilon, epsilon)
return perturbationAdvanced Universal 攻擊 Variants
AutoDAN: Readable Universal 攻擊
AutoDAN addresses GCG's primary weakness -- gibberish suffixes that are trivially detectable -- by optimizing for both attack success and readability:
# AutoDAN uses a language model to generate readable attack candidates
# then selects and mutates the most successful ones
def autodan_iteration(attack_lm, target_model, population, prompt):
"""
One iteration of AutoDAN's genetic algorithm.
Population contains readable attack suffixes.
"""
# 評估 fitness: attack success rate on target
fitness_scores = []
for suffix in population:
full_prompt = f"{prompt} {suffix}"
response = target_model.generate(full_prompt)
score = evaluate_attack_success(response)
fitness_scores.append(score)
# Select top performers
elite = select_top_k(population, fitness_scores, k=10)
# Crossover and mutation using LLM
new_population = []
for _ in range(len(population)):
parent1, parent2 = random.sample(elite, 2)
child = attack_lm.generate(
f"Combine these two texts into a new coherent paragraph "
f"that preserves the key phrases from both:\n"
f"Text 1: {parent1}\nText 2: {parent2}"
)
new_population.append(child)
return new_populationPAIR: Prompt Automatic Iterative Refinement
PAIR uses a separate 攻擊者 LLM to iteratively refine 越獄 prompts through conversational feedback:
def pair_attack(attacker_model, target_model, objective, max_rounds=20):
"""
PAIR: Use 攻擊者 LLM to iteratively craft jailbreaks.
攻擊者 model receives feedback about why previous
attempts failed and improves its strategy.
"""
conversation_history = [
{"role": "system", "content": ATTACKER_SYSTEM_PROMPT},
{"role": "user", "content": f"Objective: {objective}"}
]
for round in range(max_rounds):
# Attacker generates a candidate 越獄
attack_prompt = attacker_model.generate(conversation_history)
# 測試 against target
target_response = target_model.generate(attack_prompt)
# Judge success
success, feedback = judge_response(target_response, objective)
if success:
return attack_prompt, target_response
# Feed result back to 攻擊者
conversation_history.append({
"role": "user",
"content": f"Attempt failed. Target response: {target_response}\n"
f"Feedback: {feedback}\nTry a different approach."
})
return None, NoneTAP: Tree of 攻擊 with Pruning
TAP extends PAIR by exploring multiple attack strategies simultaneously using a tree structure:
def tap_attack(attacker_model, target_model, evaluator_model,
objective, width=10, depth=5):
"""
TAP: Explore a tree of attack strategies with pruning.
"""
# Initialize root nodes with diverse attack strategies
root_prompts = attacker_model.generate_diverse(
objective, count=width
)
tree = {0: root_prompts} # depth -> list of candidates
for d in range(1, depth + 1):
candidates = []
for parent_prompt in tree[d - 1]:
# 測試 parent against target
response = target_model.generate(parent_prompt)
score = evaluator_model.score(response, objective)
if score > SUCCESS_THRESHOLD:
return parent_prompt, response
if score > PRUNE_THRESHOLD:
# Generate child variations
children = attacker_model.refine(
parent_prompt, response, objective, count=3
)
candidates.extend(children)
# Prune to top-width candidates
tree[d] = evaluator_model.rank(candidates, objective)[:width]
return best_candidate(tree), NonePractical 攻擊 Pipeline
A realistic universal attack pipeline combines these techniques:
Seed generation with GCG on open-source models
Generate initial 對抗性 suffixes using GCG against an ensemble of open-source models (Llama 3, Mistral, Qwen). This provides candidate suffixes with broad transferability.
Readability refinement with AutoDAN
Use AutoDAN to evolve the GCG suffixes into readable variants that evade perplexity-based 偵測 while maintaining attack efficacy.
Target-specific optimization with PAIR/TAP
Deploy PAIR or TAP against the target API using the refined suffixes as seed prompts. 攻擊者 LLM iteratively adapts the attack to the specific target's 安全 訓練.
Universality validation
測試 the final attack prompts against multiple model versions and configurations to verify robustness and generalization.
防禦 Against Universal 攻擊
Defending against universal 對抗性 attacks requires multiple complementary approaches:
Perplexity Filtering
Detect GCG-style gibberish suffixes by measuring the perplexity of incoming prompts:
def perplexity_filter(prompt, threshold=100.0):
"""Reject prompts with unusually high perplexity."""
符元 = 分詞器.encode(prompt)
with torch.no_grad():
outputs = detector_model(符元)
log_probs = outputs.log_probs
perplexity = torch.exp(-log_probs.mean())
return perplexity.item() < threshold對抗性 Training
Include 對抗性 examples in 安全 訓練 to build robustness:
# Augment 安全 訓練資料 with 對抗性 examples
adversarial_training_data = []
for harmful_prompt in harmful_prompts:
# Generate 對抗性 variants
for suffix in known_adversarial_suffixes:
adversarial_training_data.append({
"prompt": f"{harmful_prompt} {suffix}",
"response": "I cannot help with that request.",
"label": "refuse"
})Erase-and-Check
SmoothLLM and related approaches add random perturbations to the 輸入 and check if 模型's response is consistent:
def smooth_llm_defense(model, prompt, num_samples=10, perturbation_rate=0.1):
"""
Randomly perturb 輸入 and check response consistency.
對抗性 suffixes are brittle to perturbation.
"""
responses = []
for _ in range(num_samples):
perturbed = randomly_perturb_tokens(prompt, perturbation_rate)
response = model.generate(perturbed)
responses.append(is_refusal(response))
# If most perturbed versions trigger refusal,
# the original likely contains an 對抗性 suffix
refusal_rate = sum(responses) / len(responses)
if refusal_rate > 0.5:
return "Request blocked: potential 對抗性 輸入"
return model.generate(prompt) # Process original相關主題
- 對抗性 Suffix Generation — Single-model 對抗性 suffix attacks
- Automated 越獄 Pipelines — Detailed PAIR, TAP, and AutoDAN 實作
- Blind 提示詞注入 — Deploying universal attacks in blind scenarios
A 紅隊 generates an 對抗性 suffix using GCG against Llama 3 8B and Mistral 7B simultaneously. They 測試 it against GPT-4. What is the most likely outcome?
參考文獻
- Zou et al., "Universal and Transferable 對抗性 攻擊 on Aligned Language Models" (2023)
- Moosavi-Dezfooli et al., "Universal 對抗性 Perturbations" (2017)
- Liu et al., "AutoDAN: Generating Stealthy 越獄 Prompts on Aligned Large Language Models" (2023)
- Chao et al., "Jailbreaking Black Box Large Language Models in Twenty Queries" (2023) -- PAIR
- Mehrotra et al., "Tree of 攻擊: Jailbreaking Black-Box LLMs with Auto-Generated Subtrees" (2023) -- TAP
- Robey et al., "SmoothLLM: Defending Large Language Models Against Jailbreaking 攻擊" (2023)