Cross-Modal Embedding Attacks
Exploitation of shared embedding spaces across modalities: CLIP adversarial images, typographic attacks, VLM injection, and dimensionality reduction attacks.
Cross-Modal Embedding Attacks
Cross-modal models like CLIP, SigLIP, and multimodal LLMs project different data types (text, images, audio) into a shared embedding space. This shared space lets an attacker craft inputs in one modality that map to adversarial targets in another, enabling attacks that cross the text-image boundary and bypass modality-specific safety filters.
Attack Surface Overview
| Attack Vector | Description | Difficulty |
|---|---|---|
| Image-to-text transfer | Craft an image whose embedding matches a target text; VLMs may "read" the target text | Medium |
| Text-to-image transfer | Craft text whose embedding matches a target image; manipulates retrieval and generation | Medium |
| Embedding collision | Find image-text pairs with the same embedding despite different semantics; bypasses moderation | High |
| Adversarial reprogramming | Modify benign images so their CLIP embeddings encode instructions that downstream LLMs follow | High |
| Modality gap exploitation | Craft inputs that sit between text and image regions, evading both classifiers | High |
Adversarial Image Generation
Gradient-Based Perturbation (PGD)
Perturb a benign image so its CLIP embedding matches a target text, using projected gradient descent within an L-infinity epsilon ball.
Encode the target text
Get the CLIP text embedding for your target instruction and normalize it.
Initialize from a base image
Process the base image into a differentiable tensor. Clone it as the reference for the epsilon constraint.
Optimize via PGD
For each step: compute image embedding, maximize cosine similarity to target text embedding, then clamp the perturbation within the epsilon ball and valid pixel range.
Verify the adversarial image
Test the final image against the target text and several distractor texts to confirm the image is classified as the target.
# Core PGD loop for CLIP adversarial image
for step in range(500):
image_emb = F.normalize(model.get_image_features(pixel_values=pixels))
loss = 1.0 - F.cosine_similarity(image_emb, target_text_emb)
loss.backward()
optimizer.step()
with torch.no_grad():
perturbation = torch.clamp(pixels - original, -epsilon, epsilon)
pixels.data = torch.clamp(original + perturbation, 0, 1)Typographic Attacks
Typographic attacks exploit the fact that CLIP was trained on images containing text. By rendering instructions directly in an image, an attacker injects content that VLMs will process.
| Technique | Method | Stealth Level |
|---|---|---|
| Small text in corners | Tiny font (6px) in image margins | Medium |
| Near-background color | Text color nearly matching background (e.g., #FAFAFA on white) | High |
| Low-opacity overlay | Tile instruction text across the image at 1-2% opacity | High |
| Metadata embedding | Place text in EXIF or other metadata fields | Varies by model |
# Low-opacity text overlay
overlay = Image.new("RGBA", base.size, (0, 0, 0, 0))
draw = ImageDraw.Draw(overlay)
alpha = int(255 * 0.02) # 2% opacity -- invisible to humans
for y in range(0, base.size[1], 20):
for x in range(0, base.size[0], 200):
draw.text((x, y), instruction, fill=(128, 128, 128, alpha))
result = Image.alpha_composite(base.convert("RGBA"), overlay)VLM Injection via Image-to-Token Projection
Modern VLMs (GPT-4V, Claude, Gemini) encode images through a vision encoder and project the result into the LLM's token space. If the projected image tokens map to representations that resemble instructions, the LLM will follow them.
Methodology
- Analyze the image-to-token mapping -- for each image token, find the nearest real text tokens by cosine similarity to understand what the LLM "sees."
- Encode target instructions -- tokenize your target instruction and get its embeddings from the LLM's embedding layer.
- Optimize the image -- minimize MSE between projected image tokens and target instruction embeddings, with a perceptual loss term to keep the image looking benign.
- Validate -- submit the adversarial image with a benign text query and verify the VLM follows the hidden instruction.
# Alignment loss between image tokens and instruction tokens
vision_features = vision_encoder(image_tensor)
projected = projection_layer(vision_features) # image tokens
min_len = min(projected.shape[1], target_embeds.shape[1])
alignment_loss = F.mse_loss(projected[0, :min_len], target_embeds[0, :min_len])
total_loss = alignment_loss + 0.1 * F.mse_loss(image_tensor, original)Dimensionality Reduction Attacks
Safety Dimension Identification
Embedding-based safety classifiers rely on specific dimensions to separate safe from unsafe content. Identifying and manipulating those dimensions enables evasion.
Collect safe and unsafe embedding samples
Encode a set of known-safe and known-unsafe texts to build a reference distribution.
Compute per-dimension mean difference
For each dimension, calculate the difference between unsafe and safe means. Rank dimensions by absolute difference.
Craft evasive embeddings
Shift the target embedding in the "safe" direction along the top safety-discriminative dimensions.
Modality Gap Exploitation
Text and image embeddings often occupy different regions of a shared space, creating a modality gap. Inputs crafted to sit in this gap may evade both text-specific and image-specific classifiers.
gap_vector = np.mean(image_embs, axis=0) - np.mean(text_embs, axis=0)
gap_center = (np.mean(text_embs, axis=0) + np.mean(image_embs, axis=0)) / 2
# Adversarial inputs near gap_center evade modality-specific detectorsAudio Cross-Modal Attacks
Multimodal models processing audio can be attacked by optimizing audio waveforms so their embeddings match a target text, constrained by a perturbation budget and monitored via signal-to-noise ratio (SNR).
| Parameter | Purpose | Typical Value |
|---|---|---|
epsilon | Max perturbation amplitude | 0.01 |
steps | Optimization iterations | 1000 |
| SNR threshold | Minimum audio quality | > 30 dB |
# Audio perturbation core loop
for step in range(steps):
audio_emb = F.normalize(audio_encoder(audio_tensor.unsqueeze(0)))
loss = 1.0 - F.cosine_similarity(audio_emb, target_text_emb)
loss.backward()
optimizer.step()
with torch.no_grad():
delta = torch.clamp(audio_tensor - original, -epsilon, epsilon)
audio_tensor.data = original + deltaWhy are typographic attacks against VLMs effective even when the text is rendered at very small sizes or near-background colors?
Related Topics
- Embedding Space Exploitation -- Foundational embedding geometry and perturbation techniques
- Multimodal Attack Vectors -- Broader multimodal attack taxonomy including document and audio attacks
- RAG Pipeline Exploitation -- Cross-modal poisoning applied to retrieval systems
- Transferable Attacks -- Cross-model transfer principles that apply to cross-modal attacks
References
- Carlini & Wagner, "Audio Adversarial Examples" (2018)
- Qi et al., "Visual Adversarial Examples Jailbreak Aligned Large Language Models" (2023)
- Liang et al., "Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning" (2022)
- Bagdasaryan et al., "Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs" (2023)
- Schlarmann & Hein, "On the Adversarial Robustness of Multi-Modal Foundation Models" (2023)