Adversarial Image Examples for VLMs

expert7 min readUpdated 2026-03-13

Pixel-level perturbations that change VLM behavior, including PGD attacks on vision encoders, transferable adversarial images, and patch attacks.

adversarial images perturbation vlm

From Classification to Generation

Adversarial examples for image classifiers are well studied -- small pixel perturbations that cause misclassification. VLM adversarial examples extend this concept: instead of changing a class label, the attacker changes the model's generated text output.

Attack Formulation

Given a VLM f that takes an image x and text prompt t to produce output y = f(x, t), the adversarial objective is:

x_adv = argmin_δ L(f(x + δ, t), y_target)
subject to: ||δ||_p ≤ ε

Where:

δ is the adversarial perturbation
ε bounds the perturbation magnitude (imperceptibility constraint)
L is a loss function measuring distance from the target output
||·||_p is typically the L∞ norm (bounding max per-pixel change)

Projected Gradient Descent (PGD)

PGD is the workhorse of adversarial image generation:

import torch
import torch.nn.functional as F
 
def pgd_attack_vlm(
    model,
    image: torch.Tensor,       # [1, 3, H, W], normalized
    text_prompt: str,
    target_tokens: torch.Tensor,  # token IDs of desired output
    epsilon: float = 8/255,     # L-inf bound
    step_size: float = 1/255,
    num_steps: int = 100,
    tokenizer=None
):
    """PGD attack to make a VLM generate target text."""
    delta = torch.zeros_like(image, requires_grad=True)
 
    for step in range(num_steps):
        adv_image = image + delta
 
        # Forward pass through VLM
        # Get logits for each position in target sequence
        outputs = model(images=adv_image, input_ids=text_prompt)
        logits = outputs.logits
 
        # Loss: cross-entropy with target tokens
        loss = F.cross_entropy(
            logits[:, -len(target_tokens):, :].reshape(-1, logits.size(-1)),
            target_tokens.reshape(-1)
        )
 
        loss.backward()
 
        # Gradient step
        with torch.no_grad():
            delta.data = delta.data - step_size * delta.grad.sign()
            # Project back to epsilon ball
            delta.data = torch.clamp(delta.data, -epsilon, epsilon)
            # Ensure valid pixel range
            delta.data = torch.clamp(image + delta.data, 0, 1) - image
 
        delta.grad.zero_()
 
    return (image + delta).detach()

Key Parameters

Parameter	Typical Range	Effect
`epsilon`	4/255 - 16/255	Higher = more effective but more visible
`step_size`	epsilon/10 - epsilon/4	Standard PGD scheduling
`num_steps`	50-500	More steps = better convergence
`target_tokens`	Varies	Shorter targets are easier to achieve

FGSM: Fast Single-Step Attack

FGSM is a faster but less effective alternative:

def fgsm_attack_vlm(model, image, text_prompt, target_tokens, epsilon=8/255):
    """Single-step adversarial attack for VLMs."""
    image.requires_grad_(True)
 
    outputs = model(images=image, input_ids=text_prompt)
    loss = F.cross_entropy(
        outputs.logits[:, -len(target_tokens):, :].reshape(-1, outputs.logits.size(-1)),
        target_tokens.reshape(-1)
    )
    loss.backward()
 
    # Single gradient step
    adv_image = image - epsilon * image.grad.sign()
    return torch.clamp(adv_image, 0, 1).detach()

FGSM is useful for quick feasibility tests but rarely sufficient for complex VLM attacks where the target output is more than a few tokens.

Adversarial Patch Attacks

Instead of perturbing the entire image, patch attacks modify a small region:

def adversarial_patch_attack(
    model,
    image: torch.Tensor,
    target_tokens: torch.Tensor,
    patch_size: int = 64,
    patch_location: tuple = (0, 0),  # top-left corner
    num_steps: int = 200,
    step_size: float = 2/255
):
    """Optimize an adversarial patch for a VLM."""
    h, w = patch_location
    patch = torch.rand(1, 3, patch_size, patch_size, requires_grad=True)
 
    optimizer = torch.optim.Adam([patch], lr=step_size)
 
    for step in range(num_steps):
        # Apply patch to image
        patched_image = image.clone()
        patched_image[:, :, h:h+patch_size, w:w+patch_size] = patch
 
        outputs = model(images=patched_image)
        loss = F.cross_entropy(
            outputs.logits[:, -len(target_tokens):, :].reshape(-1, outputs.logits.size(-1)),
            target_tokens.reshape(-1)
        )
 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
        # Clamp patch to valid pixel range
        with torch.no_grad():
            patch.data = torch.clamp(patch.data, 0, 1)
 
    return patch.detach()

Why Patches Matter

Patch attacks have practical implications beyond Lp perturbations:

They can be printed and placed in physical environments (e.g., adversarial stickers)
They are independent of the base image, potentially transferable across contexts
They can be disguised as logos, QR codes, or decorative elements

Transferability Across VLMs

Adversarial examples crafted for one VLM often transfer to others, especially when they share vision encoder families:

Source Model	Target Model	Transfer Rate (approximate)
LLaVA (CLIP ViT-L)	InstructBLIP (EVA-CLIP)	30-50%
LLaVA (CLIP ViT-L)	Qwen-VL (CLIP ViT-G)	40-60%
LLaVA (CLIP ViT-L)	GPT-4o (proprietary)	10-25%
Ensemble (3+ models)	Any single model	50-70%

Ensemble Attack Strategy

def ensemble_pgd(models, image, target_tokens, epsilon=8/255, num_steps=200):
    """PGD attack optimized against multiple VLMs for transferability."""
    delta = torch.zeros_like(image, requires_grad=True)
 
    for step in range(num_steps):
        total_loss = 0
        for model in models:
            outputs = model(images=image + delta)
            loss = compute_target_loss(outputs, target_tokens)
            total_loss += loss / len(models)  # equal weighting
 
        total_loss.backward()
 
        with torch.no_grad():
            delta.data -= (epsilon / num_steps * 4) * delta.grad.sign()
            delta.data = torch.clamp(delta.data, -epsilon, epsilon)
            delta.data = torch.clamp(image + delta.data, 0, 1) - image
 
        delta.grad.zero_()
 
    return (image + delta).detach()

Universal Adversarial Perturbations

A universal adversarial perturbation (UAP) is a single noise pattern that works across many different input images:

def train_universal_perturbation(
    model, dataset, target_tokens, epsilon=10/255, num_epochs=10
):
    """Train a universal perturbation across a dataset of images."""
    uap = torch.zeros(1, 3, 224, 224, requires_grad=True)
 
    for epoch in range(num_epochs):
        for image in dataset:
            outputs = model(images=image + uap)
            loss = compute_target_loss(outputs, target_tokens)
            loss.backward()
 
            with torch.no_grad():
                uap.data -= 0.001 * uap.grad.sign()
                uap.data = torch.clamp(uap.data, -epsilon, epsilon)
 
            uap.grad.zero_()
 
    return uap.detach()

UAPs are less effective per-image but far more practical for deployment since they require no per-input optimization.

VLM Architecture & Vision-Language Alignment -- understanding what you are attacking
Image-Based Prompt Injection -- non-gradient-based visual attacks
Lab: Crafting Image-Based Injections -- hands-on practice with these techniques

References

"Intriguing Properties of Neural Networks" - Szegedy et al. (2014) - Original discovery of adversarial examples in neural networks
"Towards Deep Learning Models Resistant to Adversarial Attacks" - Madry et al. (2018) - PGD attack methodology widely used in adversarial ML
"Universal Adversarial Perturbations" - Moosavi-Dezfooli et al. (2017) - Image-agnostic adversarial perturbation techniques
"Adversarial Patch" - Brown et al. (2017) - Physical-world adversarial patch attacks applicable to VLM systems

Knowledge Check

What is the primary advantage of adversarial patch attacks over full-image perturbation attacks in practical scenarios?

Adversarial Image Examples for VLMs

Related articles

Adversarial Image Examples for VLMs

Related articles