Adversarial Suffix Generation
GCG attacks, universal adversarial triggers, soft prompt optimization, and defense evasion techniques for automated alignment bypass.
Adversarial suffixes are automatically generated token sequences that bypass alignment. Unlike hand-crafted jailbreaks, they are produced through systematic gradient-based optimization, making them scalable and reproducible.
The GCG Attack
The GCG attack (Zou et al., 2023) is the foundational method for adversarial suffix generation. It optimizes a discrete token sequence to maximize the probability that the model begins its response with an affirmative prefix (e.g., "Sure, here is").
Problem Formulation
Given a harmful prompt x, a target prefix y (e.g., "Sure, here is..."), and a suffix s of k random tokens:
Objective: find s* that minimizes L(s) = -log P(y | x ⊕ s)
GCG Optimization Loop
Compute token gradients
Compute the gradient of the loss with respect to one-hot token embeddings at each suffix position. This tells you which token substitutions would most reduce the loss.
Generate candidates
For each suffix position, identify the top-k tokens that would most decrease the loss. Create a batch of candidates, each changing exactly one position.
Evaluate and select
Forward-pass all candidates through the model. Keep the candidate with the lowest loss as the new suffix.
Iterate until convergence
Repeat until loss drops below threshold (typically < 0.1) or max steps reached. Expect 200-500 steps on 8B models.
# Core GCG gradient computation
suffix_embeds = embed_layer(input_ids[:, suffix_slice]).requires_grad_(True)
outputs = model(inputs_embeds=full_embeds)
loss = F.cross_entropy(shift_logits.view(-1, V), shift_labels.view(-1))
loss.backward()
# Project gradient into token space (negative = tokens that DECREASE loss)
token_grads = -torch.matmul(suffix_embeds.grad, embed_weights.T)
top_substitutions = torch.topk(token_grads, k=256, dim=-1).indices# Generate and evaluate candidates (each changes one suffix position)
current_suffix = input_ids[0, suffix_slice].clone()
candidates = []
for _ in range(batch_size):
pos = torch.randint(0, suffix_len, (1,)).item()
tok = top_substitutions[pos, torch.randint(0, top_k, (1,)).item()].item()
candidate = current_suffix.clone()
candidate[pos] = tok
candidates.append(candidate)
# Evaluate all candidates, keep the best
best = min(candidates, key=lambda c: compute_loss(prompt, c, target))
if compute_loss(prompt, best, target) < current_loss:
suffix_ids = bestUniversal Adversarial Triggers
While GCG finds prompt-specific suffixes, universal triggers work across many different harmful prompts by exploiting shared patterns in safety training.
How Universal Triggers Differ from GCG
| Property | Prompt-Specific (GCG) | Universal Trigger |
|---|---|---|
| Optimized for | One specific prompt | Many prompts simultaneously |
| Gradient computation | Single prompt loss | Averaged loss across prompt set |
| Transferability | Lower | Higher (by construction) |
| Optimization cost | 30-60 min | Hours (many forward passes per step) |
| Practical use | Targeted attacks | Broad vulnerability demonstration |
The key modification is accumulating gradients across all prompt-target pairs before selecting candidates:
# Universal trigger: average gradients across multiple prompts
all_grads = None
for prompt, target in prompt_target_pairs:
grads = compute_token_gradients(prompt, suffix, target)
all_grads = grads if all_grads is None else all_grads + grads
all_grads /= len(prompt_target_pairs)
# Use averaged gradients for candidate selectionCross-Model Transferability
A key finding: adversarial suffixes can transfer between models. A suffix optimized on LLaMA may work on GPT-4 despite never accessing GPT-4's weights.
| Transfer Factor | Why It Enables Transfer |
|---|---|
| Shared training data | Similar data creates similar representations |
| Similar tokenizers | Same token structure produces similar internal states |
| Shared alignment patterns | RLHF/DPO creates similar refusal mechanisms |
| Shared math | Softmax attention has universal exploitable properties |
Soft Prompt Attacks
Soft prompt attacks operate in continuous embedding space rather than discrete token space, making optimization easier but limiting applicability.
Discrete vs Continuous Attack Trade-offs
| Property | GCG (Discrete) | Soft Prompt (Continuous) |
|---|---|---|
| Search space | Discrete tokens (hard) | Continuous vectors (easier) |
| Convergence | Slower, noisier | Faster, smoother |
| Transferability | Can transfer as text | Requires embedding injection |
| Applicability | Any text input channel | Must control embedding layer |
| Power | Lower (constrained to vocab) | Higher (full embedding space) |
# Soft prompt: optimize continuous vectors prepended to input
soft_prompt = torch.randn(1, num_tokens, d_model, requires_grad=True)
optimizer = torch.optim.Adam([soft_prompt], lr=0.01)
for step in range(num_steps):
full_embeds = torch.cat([soft_prompt, prompt_embeds, target_embeds], dim=1)
loss = compute_target_loss(model(inputs_embeds=full_embeds), target_ids)
loss.backward()
optimizer.step()Defenses and Evasion
Perplexity-Based Detection
Adversarial suffixes are typically nonsensical text with very high perplexity. A sliding-window perplexity filter can detect them:
# Detect high-perplexity regions that may be adversarial suffixes
per_token_loss = F.cross_entropy(shift_logits, shift_labels, reduction="none")
per_token_ppl = torch.exp(per_token_loss)
# Flag any 10-token window with mean PPL > threshold
for i in range(len(tokens) - window_size):
if per_token_ppl[0, i:i+window_size].mean() > 100.0:
print(f"Suspicious region at tokens {i}-{i+window_size}")Evading Perplexity Filters
| Evasion Technique | How It Works | Trade-off |
|---|---|---|
| Naturalness-constrained GCG | Add perplexity penalty to GCG objective | Slower convergence, weaker attack |
| Paraphrase post-processing | GCG suffix → LLM paraphrase → natural text | May lose adversarial effect |
| Semantic space optimization | Optimize in semantic embedding space, not token space | Requires different attack framework |
Methodology: Running a GCG Campaign
Select source model(s)
Choose open-weight model(s) in the same family as the target. For transfer attacks, use 2-3 models from different families.
Prepare prompt-target pairs
Create 10+ pairs across different harm categories. Target prefix should be an affirmative response starter ("Sure, here is...").
Run prompt-specific GCG
Optimize individual suffixes to validate the pipeline works. Track loss curves and success rate.
Optimize universal trigger
Aggregate gradients across all prompt-target pairs. This takes longer but produces transferable results.
Test transferability
Apply discovered suffixes to different models (including API-only targets). Document per-model success rates.
Evaluate against defenses
Test suffixes against perplexity filters. If detected, add naturalness constraint to the objective and re-optimize.
Why can adversarial suffixes optimized on LLaMA transfer to GPT-4, despite GPT-4's weights being inaccessible?
Related Topics
- Alignment Bypass -- The alignment internals that adversarial suffixes target
- Transferable Attacks -- Cross-model transfer of adversarial suffixes
- Fuzzing LLM Safety Boundaries -- Complementary automated discovery techniques
- AI Exploit Development -- Broader exploit development framework including GCG
References
- Universal and Transferable Adversarial Attacks on Aligned Language Models (Zou et al., 2023) — GCG attack methodology
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (Liu et al., 2023) — Automated jailbreak generation
- Adversarial Attacks on LLMs (Carlini et al., 2024) — Comprehensive survey of adversarial techniques