Adversarial Suffix Generation
GCG attacks, universal adversarial triggers, soft prompt optimization, and defense evasion techniques for automated alignment bypass.
對抗性 suffixes are automatically generated 符元 sequences that bypass 對齊. Unlike hand-crafted jailbreaks, they are produced through systematic gradient-based optimization, making them scalable and reproducible.
The GCG 攻擊
The GCG attack (Zou et al., 2023) is the foundational method for 對抗性 suffix generation. It optimizes a discrete 符元 sequence to maximize the probability that 模型 begins its response with an affirmative prefix (e.g., "Sure, here is").
Problem Formulation
Given a harmful prompt x, a target prefix y (e.g., "Sure, here is..."), and a suffix s of k random 符元:
Objective: find s* that minimizes L(s) = -log P(y | x ⊕ s)
GCG Optimization Loop
Compute 符元 gradients
Compute the gradient of the loss with respect to one-hot 符元 嵌入向量 at each suffix position. This tells you which 符元 substitutions would most reduce the loss.
Generate candidates
對每個 suffix position, 識別 the top-k 符元 that would most decrease the loss. Create a batch of candidates, each changing exactly one position.
評估 and select
Forward-pass all candidates through 模型. Keep the candidate with the lowest loss as the new suffix.
Iterate until convergence
Repeat until loss drops below threshold (typically < 0.1) or max steps reached. Expect 200-500 steps on 8B models.
# Core GCG gradient computation
suffix_embeds = embed_layer(input_ids[:, suffix_slice]).requires_grad_(True)
outputs = model(inputs_embeds=full_embeds)
loss = F.cross_entropy(shift_logits.view(-1, V), shift_labels.view(-1))
loss.backward()
# Project gradient into 符元 space (negative = 符元 that DECREASE loss)
token_grads = -torch.matmul(suffix_embeds.grad, embed_weights.T)
top_substitutions = torch.topk(token_grads, k=256, dim=-1).indices# Generate and 評估 candidates (each changes one suffix position)
current_suffix = input_ids[0, suffix_slice].clone()
candidates = []
for _ in range(batch_size):
pos = torch.randint(0, suffix_len, (1,)).item()
tok = top_substitutions[pos, torch.randint(0, top_k, (1,)).item()].item()
candidate = current_suffix.clone()
candidate[pos] = tok
candidates.append(candidate)
# 評估 all candidates, keep the best
best = min(candidates, key=lambda c: compute_loss(prompt, c, target))
if compute_loss(prompt, best, target) < current_loss:
suffix_ids = bestUniversal 對抗性 Triggers
While GCG finds prompt-specific suffixes, universal triggers work across many different harmful prompts by exploiting shared patterns in 安全 訓練.
How Universal Triggers Differ from GCG
| Property | Prompt-Specific (GCG) | Universal Trigger |
|---|---|---|
| Optimized for | One specific prompt | Many prompts simultaneously |
| Gradient computation | Single prompt loss | Averaged loss across prompt set |
| Transferability | Lower | Higher (by construction) |
| Optimization cost | 30-60 min | Hours (many forward passes per step) |
| Practical use | Targeted attacks | Broad 漏洞 demonstration |
The key modification is accumulating gradients across all prompt-target pairs before selecting candidates:
# Universal trigger: average gradients across multiple prompts
all_grads = None
for prompt, target in prompt_target_pairs:
grads = compute_token_gradients(prompt, suffix, target)
all_grads = grads if all_grads is None else all_grads + grads
all_grads /= len(prompt_target_pairs)
# Use averaged gradients for candidate selectionCross-Model Transferability
A key finding: 對抗性 suffixes can transfer between models. A suffix optimized on LLaMA may work on GPT-4 despite never accessing GPT-4's weights.
| Transfer Factor | Why It Enables Transfer |
|---|---|
| Shared 訓練資料 | Similar data creates similar representations |
| Similar tokenizers | Same 符元 structure produces similar internal states |
| Shared 對齊 patterns | RLHF/DPO creates similar refusal mechanisms |
| Shared math | Softmax 注意力 has universal exploitable properties |
Soft Prompt 攻擊
Soft prompt attacks operate in continuous 嵌入向量 space rather than discrete 符元 space, making optimization easier but limiting applicability.
Discrete vs Continuous 攻擊 Trade-offs
| Property | GCG (Discrete) | Soft Prompt (Continuous) |
|---|---|---|
| Search space | Discrete 符元 (hard) | Continuous vectors (easier) |
| Convergence | Slower, noisier | Faster, smoother |
| Transferability | Can transfer as text | Requires 嵌入向量 injection |
| Applicability | Any text 輸入 channel | Must control 嵌入向量 layer |
| Power | Lower (constrained to vocab) | Higher (full 嵌入向量 space) |
# Soft prompt: optimize continuous vectors prepended to 輸入
soft_prompt = torch.randn(1, num_tokens, d_model, requires_grad=True)
optimizer = torch.optim.Adam([soft_prompt], lr=0.01)
for step in range(num_steps):
full_embeds = torch.cat([soft_prompt, prompt_embeds, target_embeds], dim=1)
loss = compute_target_loss(model(inputs_embeds=full_embeds), target_ids)
loss.backward()
optimizer.step()防禦 and Evasion
Perplexity-Based 偵測
對抗性 suffixes are typically nonsensical text with very high perplexity. A sliding-window perplexity filter can detect them:
# Detect high-perplexity regions that may be 對抗性 suffixes
per_token_loss = F.cross_entropy(shift_logits, shift_labels, reduction="none")
per_token_ppl = torch.exp(per_token_loss)
# Flag any 10-符元 window with mean PPL > threshold
for i in range(len(符元) - window_size):
if per_token_ppl[0, i:i+window_size].mean() > 100.0:
print(f"Suspicious region at 符元 {i}-{i+window_size}")Evading Perplexity Filters
| Evasion Technique | 運作方式 | Trade-off |
|---|---|---|
| Naturalness-constrained GCG | Add perplexity penalty to GCG objective | Slower convergence, weaker attack |
| Paraphrase post-processing | GCG suffix → LLM paraphrase → natural text | May lose 對抗性 effect |
| Semantic space optimization | Optimize in semantic 嵌入向量 space, not 符元 space | Requires different attack framework |
Methodology: Running a GCG Campaign
Select source model(s)
Choose open-weight model(s) in the same family as the target. For transfer attacks, use 2-3 models from different families.
Prepare prompt-target pairs
Create 10+ pairs across different harm categories. Target prefix should be an affirmative response starter ("Sure, here is...").
Run prompt-specific GCG
Optimize individual suffixes to validate the pipeline works. Track loss curves and success rate.
Optimize universal trigger
Aggregate gradients across all prompt-target pairs. This takes longer but produces transferable results.
測試 transferability
Apply discovered suffixes to different models (including API-only targets). Document per-model success rates.
評估 against 防禦
測試 suffixes against perplexity filters. If detected, add naturalness constraint to the objective and re-optimize.
Why can 對抗性 suffixes optimized on LLaMA transfer to GPT-4, despite GPT-4's weights being inaccessible?
相關主題
- Alignment Bypass -- The 對齊 internals that 對抗性 suffixes target
- Transferable 攻擊 -- Cross-model transfer of 對抗性 suffixes
- Fuzzing LLM 安全 Boundaries -- Complementary automated discovery techniques
- AI 利用 Development -- Broader 利用 development framework including GCG
參考文獻
- Universal and Transferable 對抗性 攻擊 on Aligned Language Models (Zou et al., 2023) — GCG attack methodology
- AutoDAN: Generating Stealthy 越獄 Prompts on Aligned Large Language Models (Liu et al., 2023) — Automated 越獄 generation
- 對抗性 攻擊 on LLMs (Carlini et al., 2024) — Comprehensive survey of 對抗性 techniques