Adversarial Perturbation Attacks

expert10 min readUpdated 2026-03-14

Gradient-based pixel-level attacks against vision encoders, covering FGSM, PGD, C&W, transferability, physical-world adversarial examples, and perturbation budget constraints.

adversarial-perturbations vision-encoder FGSM PGD transferability VLM multimodal

Adversarial Perturbation Attacks

Adversarial perturbations are carefully computed pixel-level modifications to images that cause vision encoders to produce attacker-chosen outputs while remaining imperceptible to human observers. Unlike typographic attacks that exploit OCR capabilities, perturbation attacks target the statistical weaknesses of neural network feature extraction itself -- they work even when the model cannot "read" text in images.

Threat Model & Attack Taxonomy

Before crafting perturbations, define your access level and objective. These determine which algorithms are viable.

Access Level	What You Have	Viable Attacks
White-box	Full model weights, architecture, gradients	FGSM, PGD, C&W, AutoAttack
Gray-box	Architecture known, no weights	Transfer attacks from surrogate models
Black-box	API access only (queries + outputs)	Score-based (NES, SPSA), decision-based (Boundary Attack), transfer attacks

Targeted vs. Untargeted Attacks

Attack Type	Objective	Red Team Use Case
Untargeted	Cause any misclassification	Disrupt content moderation (NSFW image classified as safe)
Targeted	Force a specific output class	Cause a VLM to describe an image in a specific, attacker-chosen way

Gradient-Based Attack Methods

FGSM (Fast Gradient Sign Method)

FGSM is the simplest gradient-based attack. It computes a single gradient step and is useful as a baseline and for fast perturbation generation.

The update rule is: x_adv = x + epsilon * sign(grad_x L(model(x), y_target))

import torch
import torch.nn.functional as F
 
def fgsm_attack(model, image, target_label, epsilon=8/255):
    """
    Single-step FGSM attack.
    Args:
        model: vision encoder or full VLM pipeline
        image: input tensor [B, C, H, W], values in [0, 1]
        target_label: desired misclassification target
        epsilon: perturbation budget (L-inf)
    Returns:
        adversarial image tensor
    """
    image.requires_grad_(True)
    output = model(image)
    loss = F.cross_entropy(output, target_label)
    loss.backward()
 
    # For targeted attacks, subtract gradient (minimize loss for target class)
    perturbation = epsilon * image.grad.sign()
    adv_image = torch.clamp(image - perturbation, 0.0, 1.0)
    return adv_image.detach()

PGD (Projected Gradient Descent)

PGD is the iterative extension of FGSM and is widely considered the strongest first-order adversarial attack. It applies multiple small steps and projects back onto the allowed perturbation set after each step.

def pgd_attack(model, image, target_label, epsilon=8/255,
               step_size=2/255, num_steps=40, random_start=True):
    """
    PGD attack with random restarts.
    Args:
        epsilon: L-inf perturbation budget
        step_size: per-step perturbation magnitude
        num_steps: number of iterative steps
        random_start: initialize with random noise in epsilon-ball
    """
    adv_image = image.clone().detach()
 
    if random_start:
        adv_image = adv_image + torch.empty_like(adv_image).uniform_(-epsilon, epsilon)
        adv_image = torch.clamp(adv_image, 0.0, 1.0)
 
    for _ in range(num_steps):
        adv_image.requires_grad_(True)
        output = model(adv_image)
        loss = F.cross_entropy(output, target_label)
        loss.backward()
 
        with torch.no_grad():
            # Targeted: step toward minimizing target loss
            adv_image = adv_image - step_size * adv_image.grad.sign()
            # Project back onto epsilon-ball around original image
            delta = torch.clamp(adv_image - image, -epsilon, epsilon)
            adv_image = torch.clamp(image + delta, 0.0, 1.0)
 
    return adv_image.detach()

C&W (Carlini & Wagner) Attack

C&W directly optimizes for the smallest perturbation that achieves misclassification. It produces more imperceptible adversarial examples than PGD at the cost of computational time.

Method	Steps	Imperceptibility	Success Rate	Speed
FGSM	1	Low (uses full budget)	Moderate	Very fast
PGD-40	40	Medium	High	Moderate
PGD-200	200	Medium-high	Very high	Slow
C&W	1000+	Very high (minimizes perturbation)	Very high	Very slow

Standard Lp norm budgets used in adversarial robustness research and red teaming:

Norm	Typical Budget	What It Controls
L-inf	4/255 -- 16/255	Maximum per-pixel change
L2	0.5 -- 3.0	Total Euclidean perturbation magnitude
L0	10 -- 1000 pixels	Number of pixels modified

For red teaming, start at epsilon = 8/255 (L-inf). If the attack succeeds, reduce the budget to find the minimum viable perturbation. If it fails, increase to 16/255 before trying a different algorithm.

Transferability

Adversarial examples crafted against one model often fool other models -- this is transferability, and it enables black-box attacks against models you cannot access directly.

Maximizing Transfer Success

Choose a diverse surrogate ensemble
Use 3-5 models with different architectures (ViT-B/16, ResNet-50, ConvNeXt, CLIP ViT-L). Perturbations that fool all surrogates are more likely to transfer.
Use momentum-based optimization
MI-FGSM (Momentum Iterative FGSM) accumulates gradient momentum across steps, which stabilizes the perturbation direction and significantly improves transferability over standard PGD.
Apply input diversity
At each attack step, randomly resize and pad the image before computing the gradient. This prevents the perturbation from overfitting to the surrogate's specific input processing.
Attack intermediate features, not logits
Feature-level attacks (targeting intermediate layer activations) transfer better than logit-level attacks because different models share more structure in early/middle layers than in classification heads.

def mi_fgsm(models, image, target, epsilon=16/255, steps=20, decay=1.0):
    """Momentum Iterative FGSM against an ensemble for transfer attacks."""
    momentum = torch.zeros_like(image)
    adv = image.clone().detach()
    step_size = epsilon / steps
 
    for _ in range(steps):
        adv.requires_grad_(True)
        # Ensemble loss: average across all surrogate models
        loss = sum(F.cross_entropy(m(adv), target) for m in models) / len(models)
        loss.backward()
 
        grad = adv.grad / torch.norm(adv.grad, p=1)  # Normalize
        momentum = decay * momentum + grad             # Accumulate
        adv = (adv - step_size * momentum.sign()).detach()
        adv = torch.clamp(image + torch.clamp(adv - image, -epsilon, epsilon), 0, 1)
 
    return adv

Physical-World Adversarial Examples

Perturbations designed for digital images often fail when printed and recaptured by a camera due to color quantization, lighting changes, and perspective distortion. Physical-world attacks require additional robustness.

Physical Robustness Techniques

Technique	What It Handles
Expectation over Transformation (EoT)	Optimize the perturbation to be adversarial under random rotations, crops, brightness changes, and perspective warps
Print-scan simulation	Add simulated JPEG compression, color space conversion, and halftone dithering during optimization
Patch-based attacks	Instead of full-image perturbation, concentrate the adversarial signal in a printable patch (more robust to environmental noise)
Color-robust optimization	Constrain perturbations to colors reproducible by standard printers (CMYK gamut)

Red Team Methodology

Identify the vision encoder
Determine the target model's vision encoder architecture (ViT variant, CLIP model, etc.). Check model cards, API documentation, or probe with known adversarial examples to fingerprint the architecture.
Select attack algorithm based on access
White-box: start with PGD-40 at epsilon=8/255. Black-box: build a surrogate ensemble and use MI-FGSM with input diversity. Time-constrained: use FGSM for rapid baseline assessment.
Craft adversarial examples with decreasing budgets
Start at epsilon=16/255 and decrease in steps (16, 8, 4, 2). Record the minimum budget at which the attack still succeeds -- this measures how robust the model's decision boundary is.
Test against deployed defenses
Many production systems apply JPEG compression, Gaussian blur, or feature squeezing as preprocessing defenses. Test whether your perturbations survive these transformations.
Evaluate targeted control
Move beyond misclassification to targeted attacks: can you make the VLM produce specific attacker-chosen text? This demonstrates the full impact of the vulnerability.

Knowledge Check

Why does MI-FGSM with an ensemble of surrogate models produce more transferable adversarial examples than standard PGD against a single model?

Multimodal Attack Vectors -- Overview of all multimodal attack surfaces including typographic and steganographic techniques
Document-Based Injection Attacks -- Non-visual injection vectors through document formats
Cross-Modal Embedding Attacks -- Exploiting shared embedding spaces across modalities
Model Internals & Interpretability -- Understanding the internal representations that adversarial perturbations manipulate

References

Goodfellow et al., "Explaining and Harnessing Adversarial Examples" (2015) -- Original FGSM paper
Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks" (2018) -- PGD and adversarial training
Carlini & Wagner, "Towards Evaluating the Robustness of Neural Networks" (2017) -- C&W attack
Dong et al., "Boosting Adversarial Attacks with Momentum" (2018) -- MI-FGSM
Kurakin et al., "Adversarial Examples in the Physical World" (2017) -- Physical-world adversarial examples
Qi et al., "Visual Adversarial Examples Jailbreak Aligned Large Language Models" (2023) -- Perturbation attacks against VLMs
Croce & Hein, "Reliable Evaluation of Adversarial Robustness with an Ensemble of Attacks" (2020) -- AutoAttack

Edit this page on GitHub

Adversarial Perturbation Attacks

expert10 min readUpdated 2026-03-14

Gradient-based pixel-level attacks against vision encoders, covering FGSM, PGD, C&W, transferability, physical-world adversarial examples, and perturbation budget constraints.

adversarial-perturbations vision-encoder FGSM PGD transferability VLM multimodal

Adversarial Perturbation Attacks

Threat Model & Attack Taxonomy

Before crafting perturbations, define your access level and objective. These determine which algorithms are viable.

Access Level	What You Have	Viable Attacks
White-box	Full model weights, architecture, gradients	FGSM, PGD, C&W, AutoAttack
Gray-box	Architecture known, no weights	Transfer attacks from surrogate models
Black-box	API access only (queries + outputs)	Score-based (NES, SPSA), decision-based (Boundary Attack), transfer attacks

Targeted vs. Untargeted Attacks

Attack Type	Objective	Red Team Use Case
Untargeted	Cause any misclassification	Disrupt content moderation (NSFW image classified as safe)
Targeted	Force a specific output class	Cause a VLM to describe an image in a specific, attacker-chosen way

Gradient-Based Attack Methods

FGSM (Fast Gradient Sign Method)

FGSM is the simplest gradient-based attack. It computes a single gradient step and is useful as a baseline and for fast perturbation generation.

The update rule is: x_adv = x + epsilon * sign(grad_x L(model(x), y_target))

import torch
import torch.nn.functional as F
 
def fgsm_attack(model, image, target_label, epsilon=8/255):
    """
    Single-step FGSM attack.
    Args:
        model: vision encoder or full VLM pipeline
        image: input tensor [B, C, H, W], values in [0, 1]
        target_label: desired misclassification target
        epsilon: perturbation budget (L-inf)
    Returns:
        adversarial image tensor
    """
    image.requires_grad_(True)
    output = model(image)
    loss = F.cross_entropy(output, target_label)
    loss.backward()
 
    # For targeted attacks, subtract gradient (minimize loss for target class)
    perturbation = epsilon * image.grad.sign()
    adv_image = torch.clamp(image - perturbation, 0.0, 1.0)
    return adv_image.detach()

PGD (Projected Gradient Descent)

def pgd_attack(model, image, target_label, epsilon=8/255,
               step_size=2/255, num_steps=40, random_start=True):
    """
    PGD attack with random restarts.
    Args:
        epsilon: L-inf perturbation budget
        step_size: per-step perturbation magnitude
        num_steps: number of iterative steps
        random_start: initialize with random noise in epsilon-ball
    """
    adv_image = image.clone().detach()
 
    if random_start:
        adv_image = adv_image + torch.empty_like(adv_image).uniform_(-epsilon, epsilon)
        adv_image = torch.clamp(adv_image, 0.0, 1.0)
 
    for _ in range(num_steps):
        adv_image.requires_grad_(True)
        output = model(adv_image)
        loss = F.cross_entropy(output, target_label)
        loss.backward()
 
        with torch.no_grad():
            # Targeted: step toward minimizing target loss
            adv_image = adv_image - step_size * adv_image.grad.sign()
            # Project back onto epsilon-ball around original image
            delta = torch.clamp(adv_image - image, -epsilon, epsilon)
            adv_image = torch.clamp(image + delta, 0.0, 1.0)
 
    return adv_image.detach()

C&W (Carlini & Wagner) Attack

C&W directly optimizes for the smallest perturbation that achieves misclassification. It produces more imperceptible adversarial examples than PGD at the cost of computational time.

Method	Steps	Imperceptibility	Success Rate	Speed
FGSM	1	Low (uses full budget)	Moderate	Very fast
PGD-40	40	Medium	High	Moderate
PGD-200	200	Medium-high	Very high	Slow
C&W	1000+	Very high (minimizes perturbation)	Very high	Very slow

Standard Lp norm budgets used in adversarial robustness research and red teaming:

Norm	Typical Budget	What It Controls
L-inf	4/255 -- 16/255	Maximum per-pixel change
L2	0.5 -- 3.0	Total Euclidean perturbation magnitude
L0	10 -- 1000 pixels	Number of pixels modified

Transferability

Adversarial examples crafted against one model often fool other models -- this is transferability, and it enables black-box attacks against models you cannot access directly.

Maximizing Transfer Success

Choose a diverse surrogate ensemble
Use 3-5 models with different architectures (ViT-B/16, ResNet-50, ConvNeXt, CLIP ViT-L). Perturbations that fool all surrogates are more likely to transfer.
Use momentum-based optimization
MI-FGSM (Momentum Iterative FGSM) accumulates gradient momentum across steps, which stabilizes the perturbation direction and significantly improves transferability over standard PGD.
Apply input diversity
At each attack step, randomly resize and pad the image before computing the gradient. This prevents the perturbation from overfitting to the surrogate's specific input processing.
Attack intermediate features, not logits
Feature-level attacks (targeting intermediate layer activations) transfer better than logit-level attacks because different models share more structure in early/middle layers than in classification heads.

def mi_fgsm(models, image, target, epsilon=16/255, steps=20, decay=1.0):
    """Momentum Iterative FGSM against an ensemble for transfer attacks."""
    momentum = torch.zeros_like(image)
    adv = image.clone().detach()
    step_size = epsilon / steps
 
    for _ in range(steps):
        adv.requires_grad_(True)
        # Ensemble loss: average across all surrogate models
        loss = sum(F.cross_entropy(m(adv), target) for m in models) / len(models)
        loss.backward()
 
        grad = adv.grad / torch.norm(adv.grad, p=1)  # Normalize
        momentum = decay * momentum + grad             # Accumulate
        adv = (adv - step_size * momentum.sign()).detach()
        adv = torch.clamp(image + torch.clamp(adv - image, -epsilon, epsilon), 0, 1)
 
    return adv

Physical-World Adversarial Examples

Physical Robustness Techniques

Technique	What It Handles
Expectation over Transformation (EoT)	Optimize the perturbation to be adversarial under random rotations, crops, brightness changes, and perspective warps
Print-scan simulation	Add simulated JPEG compression, color space conversion, and halftone dithering during optimization
Patch-based attacks	Instead of full-image perturbation, concentrate the adversarial signal in a printable patch (more robust to environmental noise)
Color-robust optimization	Constrain perturbations to colors reproducible by standard printers (CMYK gamut)

Red Team Methodology

Identify the vision encoder
Determine the target model's vision encoder architecture (ViT variant, CLIP model, etc.). Check model cards, API documentation, or probe with known adversarial examples to fingerprint the architecture.
Select attack algorithm based on access
White-box: start with PGD-40 at epsilon=8/255. Black-box: build a surrogate ensemble and use MI-FGSM with input diversity. Time-constrained: use FGSM for rapid baseline assessment.
Craft adversarial examples with decreasing budgets
Start at epsilon=16/255 and decrease in steps (16, 8, 4, 2). Record the minimum budget at which the attack still succeeds -- this measures how robust the model's decision boundary is.
Test against deployed defenses
Many production systems apply JPEG compression, Gaussian blur, or feature squeezing as preprocessing defenses. Test whether your perturbations survive these transformations.
Evaluate targeted control
Move beyond misclassification to targeted attacks: can you make the VLM produce specific attacker-chosen text? This demonstrates the full impact of the vulnerability.

Knowledge Check

Why does MI-FGSM with an ensemble of surrogate models produce more transferable adversarial examples than standard PGD against a single model?

Multimodal Attack Vectors -- Overview of all multimodal attack surfaces including typographic and steganographic techniques
Document-Based Injection Attacks -- Non-visual injection vectors through document formats
Cross-Modal Embedding Attacks -- Exploiting shared embedding spaces across modalities
Model Internals & Interpretability -- Understanding the internal representations that adversarial perturbations manipulate

References

Goodfellow et al., "Explaining and Harnessing Adversarial Examples" (2015) -- Original FGSM paper
Madry et al., "Towards Deep Learning Models Resistant to Adversarial Attacks" (2018) -- PGD and adversarial training
Carlini & Wagner, "Towards Evaluating the Robustness of Neural Networks" (2017) -- C&W attack
Dong et al., "Boosting Adversarial Attacks with Momentum" (2018) -- MI-FGSM
Kurakin et al., "Adversarial Examples in the Physical World" (2017) -- Physical-world adversarial examples
Qi et al., "Visual Adversarial Examples Jailbreak Aligned Large Language Models" (2023) -- Perturbation attacks against VLMs
Croce & Hein, "Reliable Evaluation of Adversarial Robustness with an Ensemble of Attacks" (2020) -- AutoAttack

Edit this page on GitHub

Adversarial Perturbation Attacks

Choose a diverse surrogate ensemble

Use momentum-based optimization

Apply input diversity

Attack intermediate features, not logits

Identify the vision encoder

Select attack algorithm based on access

Craft adversarial examples with decreasing budgets

Test against deployed defenses

Evaluate targeted control

Related articles

Adversarial Perturbation Attacks

Choose a diverse surrogate ensemble

Use momentum-based optimization

Apply input diversity

Attack intermediate features, not logits

Identify the vision encoder

Select attack algorithm based on access

Craft adversarial examples with decreasing budgets

Test against deployed defenses

Evaluate targeted control

Related articles