Adversarial Perturbation Attacks
Gradient-based pixel-level attacks against vision encoders, covering FGSM, PGD, C&W, transferability, physical-world adversarial examples, and perturbation budget constraints.
Adversarial Perturbation Attacks
Adversarial perturbations are carefully computed pixel-level modifications to images that cause vision encoders to produce attacker-chosen outputs while remaining imperceptible to human observers. Unlike typographic attacks that exploit OCR capabilities, perturbation attacks target the statistical weaknesses of neural network feature extraction itself -- they work even when the model cannot "read" text in images.
Threat Model & Attack Taxonomy
Before crafting perturbations, define your access level and objective. These determine which algorithms are viable.
| Access Level | What You Have | Viable Attacks |
|---|---|---|
| White-box | Full model weights, architecture, gradients | FGSM, PGD, C&W, AutoAttack |
| Gray-box | Architecture known, no weights | Transfer attacks from surrogate models |
| Black-box | API access only (queries + outputs) | Score-based (NES, SPSA), decision-based (Boundary Attack), transfer attacks |
Targeted vs. Untargeted Attacks
| Attack Type | Objective | Red Team Use Case |
|---|---|---|
| Untargeted | Cause any misclassification | Disrupt content moderation (NSFW image classified as safe) |
| Targeted | Force a specific output class | Cause a VLM to describe an image in a specific, attacker-chosen way |
Gradient-Based Attack Methods
FGSM (Fast Gradient Sign Method)
FGSM is the simplest gradient-based attack. It computes a single gradient step and is useful as a baseline and for fast perturbation generation.
The update rule is: x_adv = x + epsilon * sign(grad_x L(model(x), y_target))
import torch
import torch.nn.functional as F
def fgsm_attack(model, image, target_label, epsilon=8/255):
"""
Single-step FGSM attack.
Args:
model: vision encoder or full VLM pipeline
image: input tensor [B, C, H, W], values in [0, 1]
target_label: desired misclassification target
epsilon: perturbation budget (L-inf)
Returns:
adversarial image tensor
"""
image.requires_grad_(True)
output = model(image)
loss = F.cross_entropy(output, target_label)
loss.backward()
# For targeted attacks, subtract gradient (minimize loss for target class)
perturbation = epsilon * image.grad.sign()
adv_image = torch.clamp(image - perturbation, 0.0, 1.0)
return adv_image.detach()PGD (Projected Gradient Descent)
PGD is the iterative extension of FGSM and is widely considered the strongest first-order adversarial attack. It applies multiple small steps and projects back onto the allowed perturbation set after each step.
def pgd_attack(model, image, target_label, epsilon=8/255,
step_size=2/255, num_steps=40, random_start=True):
"""
PGD attack with random restarts.
Args:
epsilon: L-inf perturbation budget
step_size: per-step perturbation magnitude
num_steps: number of iterative steps
random_start: initialize with random noise in epsilon-ball
"""
adv_image = image.clone().detach()
if random_start:
adv_image = adv_image + torch.empty_like(adv_image).uniform_(-epsilon, epsilon)
adv_image = torch.clamp(adv_image, 0.0, 1.0)
for _ in range(num_steps):
adv_image.requires_grad_(True)
output = model(adv_image)
loss = F.cross_entropy(output, target_label)
loss.backward()
with torch.no_grad():
# Targeted: step toward minimizing target loss
adv_image = adv_image - step_size * adv_image.grad.sign()
# Project back onto epsilon-ball around original image
delta = torch.clamp(adv_image - image, -epsilon, epsilon)
adv_image = torch.clamp(image + delta, 0.0, 1.0)
return adv_image.detach()C&W (Carlini & Wagner) Attack
C&W directly optimizes for the smallest perturbation that achieves misclassification. It produces more imperceptible adversarial examples than PGD at the cost of computational time.
| Method | Steps | Imperceptibility | Success Rate | Speed |
|---|---|---|---|---|
| FGSM | 1 | Low (uses full budget) | Moderate | Very fast |
| PGD-40 | 40 | Medium | High | Moderate |
| PGD-200 | 200 | Medium-high | Very high | Slow |
| C&W | 1000+ | Very high (minimizes perturbation) | Very high | Very slow |
Standard Lp norm budgets used in adversarial robustness research and red teaming:
| Norm | Typical Budget | What It Controls |
|---|---|---|
| L-inf | 4/255 -- 16/255 | Maximum per-pixel change |
| L2 | 0.5 -- 3.0 | Total Euclidean perturbation magnitude |
| L0 | 10 -- 1000 pixels | Number of pixels modified |
For red teaming, start at epsilon = 8/255 (L-inf). If the attack succeeds, reduce the budget to find the minimum viable perturbation. If it fails, increase to 16/255 before trying a different algorithm.
Transferability
Adversarial examples crafted against one model often fool other models -- this is transferability, and it enables black-box attacks against models you cannot access directly.
Maximizing Transfer Success
Choose a diverse surrogate ensemble
Use 3-5 models with different architectures (ViT-B/16, ResNet-50, ConvNeXt, CLIP ViT-L). Perturbations that fool all surrogates are more likely to transfer.
Use momentum-based optimization
MI-FGSM (Momentum Iterative FGSM) accumulates gradient momentum across steps, which stabilizes the perturbation direction and significantly improves transferability over standard PGD.
Apply input diversity
At each attack step, randomly resize and pad the image before computing the gradient. This prevents the perturbation from overfitting to the surrogate's specific input processing.
Attack intermediate features, not logits
Feature-level attacks (targeting intermediate layer activations) transfer better than logit-level attacks because different models share more structure in early/middle layers than in classification heads.
def mi_fgsm(models, image, target, epsilon=16/255, steps=20, decay=1.0):
"""Momentum Iterative FGSM against an ensemble for transfer attacks."""
momentum = torch.zeros_like(image)
adv = image.clone().detach()
step_size = epsilon / steps
for _ in range(steps):
adv.requires_grad_(True)
# Ensemble loss: average across all surrogate models
loss = sum(F.cross_entropy(m(adv), target) for m in models) / len(models)
loss.backward()
grad = adv.grad / torch.norm(adv.grad, p=1) # Normalize
momentum = decay * momentum + grad # Accumulate
adv = (adv - step_size * momentum.sign()).detach()
adv = torch.clamp(image + torch.clamp(adv - image, -epsilon, epsilon), 0, 1)
return advPhysical-World Adversarial Examples
Perturbations designed for digital images often fail when printed and recaptured by a camera due to color quantization, lighting changes, and perspective distortion. Physical-world attacks require additional robustness.
Physical Robustness Techniques
| Technique | What It Handles |
|---|---|
| Expectation over Transformation (EoT) | Optimize the perturbation to be adversarial under random rotations, crops, brightness changes, and perspective warps |
| Print-scan simulation | Add simulated JPEG compression, color space conversion, and halftone dithering during optimization |
| Patch-based attacks | Instead of full-image perturbation, concentrate the adversarial signal in a printable patch (more robust to environmental noise) |
| Color-robust optimization | Constrain perturbations to colors reproducible by standard printers (CMYK gamut) |
Red Team Methodology
Identify the vision encoder
Determine the target model's vision encoder architecture (ViT variant, CLIP model, etc.). Check model cards, API documentation, or probe with known adversarial examples to fingerprint the architecture.
Select attack algorithm based on access
White-box: start with PGD-40 at epsilon=8/255. Black-box: build a surrogate ensemble and use MI-FGSM with input diversity. Time-constrained: use FGSM for rapid baseline assessment.
Craft adversarial examples with decreasing budgets
Start at epsilon=16/255 and decrease in steps (16, 8, 4, 2). Record the minimum budget at which the attack still succeeds -- this measures how robust the model's decision boundary is.
Test against deployed defenses
Many production systems apply JPEG compression, Gaussian blur, or feature squeezing as preprocessing defenses. Test whether your perturbations survive these transformations.
Evaluate targeted control
Move beyond misclassification to targeted attacks: can you make the VLM produce specific attacker-chosen text? This demonstrates the full impact of the vulnerability.
Why does MI-FGSM with an ensemble of surrogate models produce more transferable adversarial examples than standard PGD against a single model?
Related Topics
- Multimodal Attack Vectors -- Overview of all multimodal attack surfaces including typographic and steganographic techniques
- Document-Based Injection Attacks -- Non-visual injection vectors through document formats
- Cross-Modal Embedding Attacks -- Exploiting shared embedding spaces across modalities
- Model Internals & Interpretability -- Understanding the internal representations that adversarial perturbations manipulate
References
- Goodfellow et al., "Explaining and Harnessing Adversarial Examples" (2015) -- Original FGSM paper
- Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks" (2018) -- PGD and adversarial training
- Carlini & Wagner, "Towards Evaluating the Robustness of Neural Networks" (2017) -- C&W attack
- Dong et al., "Boosting Adversarial Attacks with Momentum" (2018) -- MI-FGSM
- Kurakin et al., "Adversarial Examples in the Physical World" (2017) -- Physical-world adversarial examples
- Qi et al., "Visual Adversarial Examples Jailbreak Aligned Large Language Models" (2023) -- Perturbation attacks against VLMs
- Croce & Hein, "Reliable Evaluation of Adversarial Robustness with an Ensemble of Attacks" (2020) -- AutoAttack