Adversarial Perturbation 攻擊s
Gradient-based pixel-level attacks against vision encoders, covering FGSM, PGD, C&W, transferability, physical-world adversarial examples, and perturbation budget constraints.
對抗性 Perturbation 攻擊
對抗性 perturbations are carefully computed pixel-level modifications to images that cause vision encoders to produce 攻擊者-chosen outputs while remaining imperceptible to human observers. Unlike typographic attacks that 利用 OCR capabilities, perturbation attacks target the statistical weaknesses of neural network feature extraction itself -- they work even when 模型 cannot "read" text in images.
Threat Model & 攻擊 Taxonomy
Before crafting perturbations, define your access level and objective. These determine which algorithms are viable.
| Access Level | What You Have | Viable 攻擊 |
|---|---|---|
| White-box | Full model weights, architecture, gradients | FGSM, PGD, C&W, AutoAttack |
| Gray-box | Architecture known, no weights | Transfer attacks from surrogate models |
| Black-box | API access only (queries + outputs) | Score-based (NES, SPSA), decision-based (Boundary 攻擊), transfer attacks |
Targeted vs. Untargeted 攻擊
| 攻擊 Type | Objective | 紅隊 Use Case |
|---|---|---|
| Untargeted | Cause any misclassification | Disrupt content moderation (NSFW image classified as safe) |
| Targeted | Force a specific 輸出 class | Cause a VLM to describe an image in a specific, 攻擊者-chosen way |
Gradient-Based 攻擊 Methods
FGSM (Fast Gradient Sign Method)
FGSM is the simplest gradient-based attack. It computes a single gradient step and is useful as a baseline and for fast perturbation generation.
The update rule is: x_adv = x + epsilon * sign(grad_x L(model(x), y_target))
import torch
import torch.nn.functional as F
def fgsm_attack(model, image, target_label, epsilon=8/255):
"""
Single-step FGSM attack.
Args:
model: vision encoder or full VLM pipeline
image: 輸入 tensor [B, C, H, W], values in [0, 1]
target_label: desired misclassification target
epsilon: perturbation budget (L-inf)
Returns:
對抗性 image tensor
"""
image.requires_grad_(True)
輸出 = model(image)
loss = F.cross_entropy(輸出, target_label)
loss.backward()
# For targeted attacks, subtract gradient (minimize loss for target class)
perturbation = epsilon * image.grad.sign()
adv_image = torch.clamp(image - perturbation, 0.0, 1.0)
return adv_image.detach()PGD (Projected Gradient Descent)
PGD is the iterative extension of FGSM and is widely considered the strongest first-order 對抗性 attack. It applies multiple small steps and projects back onto the allowed perturbation set after each step.
def pgd_attack(model, image, target_label, epsilon=8/255,
step_size=2/255, num_steps=40, random_start=True):
"""
PGD attack with random restarts.
Args:
epsilon: L-inf perturbation budget
step_size: per-step perturbation magnitude
num_steps: number of iterative steps
random_start: initialize with random noise in epsilon-ball
"""
adv_image = image.clone().detach()
if random_start:
adv_image = adv_image + torch.empty_like(adv_image).uniform_(-epsilon, epsilon)
adv_image = torch.clamp(adv_image, 0.0, 1.0)
for _ in range(num_steps):
adv_image.requires_grad_(True)
輸出 = model(adv_image)
loss = F.cross_entropy(輸出, target_label)
loss.backward()
with torch.no_grad():
# Targeted: step toward minimizing target loss
adv_image = adv_image - step_size * adv_image.grad.sign()
# Project back onto epsilon-ball around original image
delta = torch.clamp(adv_image - image, -epsilon, epsilon)
adv_image = torch.clamp(image + delta, 0.0, 1.0)
return adv_image.detach()C&W (Carlini & Wagner) 攻擊
C&W directly optimizes for the smallest perturbation that achieves misclassification. It produces more imperceptible 對抗性 examples than PGD at the cost of computational time.
| Method | Steps | Imperceptibility | Success Rate | Speed |
|---|---|---|---|---|
| FGSM | 1 | Low (uses full budget) | Moderate | Very fast |
| PGD-40 | 40 | Medium | High | Moderate |
| PGD-200 | 200 | Medium-high | Very high | Slow |
| C&W | 1000+ | Very high (minimizes perturbation) | Very high | Very slow |
Standard Lp norm budgets used in 對抗性 robustness research and 紅隊演練:
| Norm | Typical Budget | What It Controls |
|---|---|---|
| L-inf | 4/255 -- 16/255 | Maximum per-pixel change |
| L2 | 0.5 -- 3.0 | Total Euclidean perturbation magnitude |
| L0 | 10 -- 1000 pixels | Number of pixels modified |
For 紅隊演練, start at epsilon = 8/255 (L-inf). If the attack succeeds, reduce the budget to find the minimum viable perturbation. If it fails, increase to 16/255 before trying a different algorithm.
Transferability
對抗性 examples crafted against one model often fool other models -- 這是 transferability, and it enables black-box attacks against models you cannot access directly.
Maximizing Transfer Success
Choose a diverse surrogate ensemble
Use 3-5 models with different architectures (ViT-B/16, ResNet-50, ConvNeXt, CLIP ViT-L). Perturbations that fool all surrogates are more likely to transfer.
Use momentum-based optimization
MI-FGSM (Momentum Iterative FGSM) accumulates gradient momentum across steps, which stabilizes the perturbation direction and significantly improves transferability over standard PGD.
Apply 輸入 diversity
At each attack step, randomly resize and pad the image before computing the gradient. This prevents the perturbation from overfitting to the surrogate's specific 輸入 processing.
攻擊 intermediate features, not logits
Feature-level attacks (targeting intermediate layer activations) transfer better than logit-level attacks 因為 different models share more structure in early/middle layers than in classification heads.
def mi_fgsm(models, image, target, epsilon=16/255, steps=20, decay=1.0):
"""Momentum Iterative FGSM against an ensemble for transfer attacks."""
momentum = torch.zeros_like(image)
adv = image.clone().detach()
step_size = epsilon / steps
for _ in range(steps):
adv.requires_grad_(True)
# Ensemble loss: average across all surrogate models
loss = sum(F.cross_entropy(m(adv), target) for m in models) / len(models)
loss.backward()
grad = adv.grad / torch.norm(adv.grad, p=1) # Normalize
momentum = decay * momentum + grad # Accumulate
adv = (adv - step_size * momentum.sign()).detach()
adv = torch.clamp(image + torch.clamp(adv - image, -epsilon, epsilon), 0, 1)
return advPhysical-World 對抗性 範例
Perturbations designed for digital images often fail when printed and recaptured by a camera due to color quantization, lighting changes, and perspective distortion. Physical-world attacks require additional robustness.
Physical Robustness Techniques
| Technique | What It Handles |
|---|---|
| Expectation over Transformation (EoT) | Optimize the perturbation to be 對抗性 under random rotations, crops, brightness changes, and perspective warps |
| Print-scan simulation | Add simulated JPEG compression, color space conversion, and halftone dithering during optimization |
| Patch-based attacks | Instead of full-image perturbation, concentrate the 對抗性 signal in a printable patch (more robust to environmental noise) |
| Color-robust optimization | Constrain perturbations to colors reproducible by standard printers (CMYK gamut) |
紅隊 Methodology
識別 the vision encoder
Determine the target model's vision encoder architecture (ViT variant, CLIP model, etc.). Check model cards, API documentation, or probe with known 對抗性 examples to fingerprint the architecture.
Select attack algorithm based on access
White-box: start with PGD-40 at epsilon=8/255. Black-box: build a surrogate ensemble and use MI-FGSM with 輸入 diversity. Time-constrained: use FGSM for rapid baseline 評估.
Craft 對抗性 examples with decreasing budgets
Start at epsilon=16/255 and decrease in steps (16, 8, 4, 2). Record the minimum budget at which the attack still succeeds -- this measures how robust 模型's decision boundary is.
測試 against deployed 防禦
Many production systems apply JPEG compression, Gaussian blur, or feature squeezing as preprocessing 防禦. 測試 whether your perturbations survive these transformations.
評估 targeted control
Move beyond misclassification to targeted attacks: can you make the VLM produce specific 攻擊者-chosen text? This demonstrates the full impact of the 漏洞.
Why does MI-FGSM with an ensemble of surrogate models produce more transferable 對抗性 examples than standard PGD against a single model?
相關主題
- Multimodal 攻擊 Vectors -- 概覽 of all multimodal attack surfaces including typographic and steganographic techniques
- Document-Based Injection 攻擊 -- Non-visual injection vectors through document formats
- Cross-Modal 嵌入向量 攻擊 -- Exploiting shared 嵌入向量 spaces across modalities
- Model Internals & Interpretability -- 理解 the internal representations that 對抗性 perturbations manipulate
參考文獻
- Goodfellow et al., "Explaining and Harnessing 對抗性 範例" (2015) -- Original FGSM paper
- Madry et al., "Towards Deep Learning Models Resistant to 對抗性 攻擊" (2018) -- PGD and 對抗性 訓練
- Carlini & Wagner, "Towards Evaluating the Robustness of Neural Networks" (2017) -- C&W attack
- Dong et al., "Boosting 對抗性 攻擊 with Momentum" (2018) -- MI-FGSM
- Kurakin et al., "對抗性 範例 in the Physical World" (2017) -- Physical-world 對抗性 examples
- Qi et al., "Visual 對抗性 範例 越獄 Aligned Large Language Models" (2023) -- Perturbation attacks against VLMs
- Croce & Hein, "Reliable 評估 of 對抗性 Robustness with an Ensemble of 攻擊" (2020) -- AutoAttack