Adversarial Image Examples for VLMs
Pixel-level perturbations that change VLM behavior, including PGD attacks on vision encoders, transferable adversarial images, and patch attacks.
From Classification to Generation
Adversarial examples for image classifiers are well studied -- small pixel perturbations that cause misclassification. VLM adversarial examples extend this concept: instead of changing a class label, the attacker changes the model's generated text output.
Attack Formulation
Given a VLM f that takes an image x and text prompt t to produce output y = f(x, t), the adversarial objective is:
x_adv = argmin_δ L(f(x + δ, t), y_target)
subject to: ||δ||_p ≤ ε
Where:
δis the adversarial perturbationεbounds the perturbation magnitude (imperceptibility constraint)Lis a loss function measuring distance from the target output||·||_pis typically the L∞ norm (bounding max per-pixel change)
Projected Gradient Descent (PGD)
PGD is the workhorse of adversarial image generation:
import torch
import torch.nn.functional as F
def pgd_attack_vlm(
model,
image: torch.Tensor, # [1, 3, H, W], normalized
text_prompt: str,
target_tokens: torch.Tensor, # token IDs of desired output
epsilon: float = 8/255, # L-inf bound
step_size: float = 1/255,
num_steps: int = 100,
tokenizer=None
):
"""PGD attack to make a VLM generate target text."""
delta = torch.zeros_like(image, requires_grad=True)
for step in range(num_steps):
adv_image = image + delta
# Forward pass through VLM
# Get logits for each position in target sequence
outputs = model(images=adv_image, input_ids=text_prompt)
logits = outputs.logits
# Loss: cross-entropy with target tokens
loss = F.cross_entropy(
logits[:, -len(target_tokens):, :].reshape(-1, logits.size(-1)),
target_tokens.reshape(-1)
)
loss.backward()
# Gradient step
with torch.no_grad():
delta.data = delta.data - step_size * delta.grad.sign()
# Project back to epsilon ball
delta.data = torch.clamp(delta.data, -epsilon, epsilon)
# Ensure valid pixel range
delta.data = torch.clamp(image + delta.data, 0, 1) - image
delta.grad.zero_()
return (image + delta).detach()Key Parameters
| Parameter | Typical Range | Effect |
|---|---|---|
epsilon | 4/255 - 16/255 | Higher = more effective but more visible |
step_size | epsilon/10 - epsilon/4 | Standard PGD scheduling |
num_steps | 50-500 | More steps = better convergence |
target_tokens | Varies | Shorter targets are easier to achieve |
FGSM: Fast Single-Step Attack
FGSM is a faster but less effective alternative:
def fgsm_attack_vlm(model, image, text_prompt, target_tokens, epsilon=8/255):
"""Single-step adversarial attack for VLMs."""
image.requires_grad_(True)
outputs = model(images=image, input_ids=text_prompt)
loss = F.cross_entropy(
outputs.logits[:, -len(target_tokens):, :].reshape(-1, outputs.logits.size(-1)),
target_tokens.reshape(-1)
)
loss.backward()
# Single gradient step
adv_image = image - epsilon * image.grad.sign()
return torch.clamp(adv_image, 0, 1).detach()FGSM is useful for quick feasibility tests but rarely sufficient for complex VLM attacks where the target output is more than a few tokens.
Adversarial Patch Attacks
Instead of perturbing the entire image, patch attacks modify a small region:
def adversarial_patch_attack(
model,
image: torch.Tensor,
target_tokens: torch.Tensor,
patch_size: int = 64,
patch_location: tuple = (0, 0), # top-left corner
num_steps: int = 200,
step_size: float = 2/255
):
"""Optimize an adversarial patch for a VLM."""
h, w = patch_location
patch = torch.rand(1, 3, patch_size, patch_size, requires_grad=True)
optimizer = torch.optim.Adam([patch], lr=step_size)
for step in range(num_steps):
# Apply patch to image
patched_image = image.clone()
patched_image[:, :, h:h+patch_size, w:w+patch_size] = patch
outputs = model(images=patched_image)
loss = F.cross_entropy(
outputs.logits[:, -len(target_tokens):, :].reshape(-1, outputs.logits.size(-1)),
target_tokens.reshape(-1)
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Clamp patch to valid pixel range
with torch.no_grad():
patch.data = torch.clamp(patch.data, 0, 1)
return patch.detach()Why Patches Matter
Patch attacks have practical implications beyond Lp perturbations:
- They can be printed and placed in physical environments (e.g., adversarial stickers)
- They are independent of the base image, potentially transferable across contexts
- They can be disguised as logos, QR codes, or decorative elements
Transferability Across VLMs
Adversarial examples crafted for one VLM often transfer to others, especially when they share vision encoder families:
| Source Model | Target Model | Transfer Rate (approximate) |
|---|---|---|
| LLaVA (CLIP ViT-L) | InstructBLIP (EVA-CLIP) | 30-50% |
| LLaVA (CLIP ViT-L) | Qwen-VL (CLIP ViT-G) | 40-60% |
| LLaVA (CLIP ViT-L) | GPT-4o (proprietary) | 10-25% |
| Ensemble (3+ models) | Any single model | 50-70% |
Ensemble Attack Strategy
def ensemble_pgd(models, image, target_tokens, epsilon=8/255, num_steps=200):
"""PGD attack optimized against multiple VLMs for transferability."""
delta = torch.zeros_like(image, requires_grad=True)
for step in range(num_steps):
total_loss = 0
for model in models:
outputs = model(images=image + delta)
loss = compute_target_loss(outputs, target_tokens)
total_loss += loss / len(models) # equal weighting
total_loss.backward()
with torch.no_grad():
delta.data -= (epsilon / num_steps * 4) * delta.grad.sign()
delta.data = torch.clamp(delta.data, -epsilon, epsilon)
delta.data = torch.clamp(image + delta.data, 0, 1) - image
delta.grad.zero_()
return (image + delta).detach()Universal Adversarial Perturbations
A universal adversarial perturbation (UAP) is a single noise pattern that works across many different input images:
def train_universal_perturbation(
model, dataset, target_tokens, epsilon=10/255, num_epochs=10
):
"""Train a universal perturbation across a dataset of images."""
uap = torch.zeros(1, 3, 224, 224, requires_grad=True)
for epoch in range(num_epochs):
for image in dataset:
outputs = model(images=image + uap)
loss = compute_target_loss(outputs, target_tokens)
loss.backward()
with torch.no_grad():
uap.data -= 0.001 * uap.grad.sign()
uap.data = torch.clamp(uap.data, -epsilon, epsilon)
uap.grad.zero_()
return uap.detach()UAPs are less effective per-image but far more practical for deployment since they require no per-input optimization.
Related Topics
- VLM Architecture & Vision-Language Alignment -- understanding what you are attacking
- Image-Based Prompt Injection -- non-gradient-based visual attacks
- Lab: Crafting Image-Based Injections -- hands-on practice with these techniques
References
- "Intriguing Properties of Neural Networks" - Szegedy et al. (2014) - Original discovery of adversarial examples in neural networks
- "Towards Deep Learning Models Resistant to Adversarial Attacks" - Madry et al. (2018) - PGD attack methodology widely used in adversarial ML
- "Universal Adversarial Perturbations" - Moosavi-Dezfooli et al. (2017) - Image-agnostic adversarial perturbation techniques
- "Adversarial Patch" - Brown et al. (2017) - Physical-world adversarial patch attacks applicable to VLM systems
What is the primary advantage of adversarial patch attacks over full-image perturbation attacks in practical scenarios?