Cross-Modal Embedding Attacks

expert9 min readUpdated 2026-03-12

Exploitation of shared embedding spaces across modalities: CLIP adversarial images, typographic attacks, VLM injection, and dimensionality reduction attacks.

CLIP multimodal cross-modal vision-language adversarial-images

Cross-modal models like CLIP, SigLIP, and multimodal LLMs project different data types (text, images, audio) into a shared embedding space. This shared space lets an attacker craft inputs in one modality that map to adversarial targets in another, enabling attacks that cross the text-image boundary and bypass modality-specific safety filters.

Attack Surface Overview

Attack Vector	Description	Difficulty
Image-to-text transfer	Craft an image whose embedding matches a target text; VLMs may "read" the target text	Medium
Text-to-image transfer	Craft text whose embedding matches a target image; manipulates retrieval and generation	Medium
Embedding collision	Find image-text pairs with the same embedding despite different semantics; bypasses moderation	High
Adversarial reprogramming	Modify benign images so their CLIP embeddings encode instructions that downstream LLMs follow	High
Modality gap exploitation	Craft inputs that sit between text and image regions, evading both classifiers	High

Adversarial Image Generation

Gradient-Based Perturbation (PGD)

Perturb a benign image so its CLIP embedding matches a target text, using projected gradient descent within an L-infinity epsilon ball.

Encode the target text
Get the CLIP text embedding for your target instruction and normalize it.
Initialize from a base image
Process the base image into a differentiable tensor. Clone it as the reference for the epsilon constraint.
Optimize via PGD
For each step: compute image embedding, maximize cosine similarity to target text embedding, then clamp the perturbation within the epsilon ball and valid pixel range.
Verify the adversarial image
Test the final image against the target text and several distractor texts to confirm the image is classified as the target.

# Core PGD loop for CLIP adversarial image
for step in range(500):
    image_emb = F.normalize(model.get_image_features(pixel_values=pixels))
    loss = 1.0 - F.cosine_similarity(image_emb, target_text_emb)
    loss.backward()
    optimizer.step()
    with torch.no_grad():
        perturbation = torch.clamp(pixels - original, -epsilon, epsilon)
        pixels.data = torch.clamp(original + perturbation, 0, 1)

Typographic Attacks

Typographic attacks exploit the fact that CLIP was trained on images containing text. By rendering instructions directly in an image, an attacker injects content that VLMs will process.

Technique	Method	Stealth Level
Small text in corners	Tiny font (6px) in image margins	Medium
Near-background color	Text color nearly matching background (e.g., #FAFAFA on white)	High
Low-opacity overlay	Tile instruction text across the image at 1-2% opacity	High
Metadata embedding	Place text in EXIF or other metadata fields	Varies by model

# Low-opacity text overlay
overlay = Image.new("RGBA", base.size, (0, 0, 0, 0))
draw = ImageDraw.Draw(overlay)
alpha = int(255 * 0.02)  # 2% opacity -- invisible to humans
for y in range(0, base.size[1], 20):
    for x in range(0, base.size[0], 200):
        draw.text((x, y), instruction, fill=(128, 128, 128, alpha))
result = Image.alpha_composite(base.convert("RGBA"), overlay)

VLM Injection via Image-to-Token Projection

Modern VLMs (GPT-4V, Claude, Gemini) encode images through a vision encoder and project the result into the LLM's token space. If the projected image tokens map to representations that resemble instructions, the LLM will follow them.

Methodology

Analyze the image-to-token mapping -- for each image token, find the nearest real text tokens by cosine similarity to understand what the LLM "sees."
Encode target instructions -- tokenize your target instruction and get its embeddings from the LLM's embedding layer.
Optimize the image -- minimize MSE between projected image tokens and target instruction embeddings, with a perceptual loss term to keep the image looking benign.
Validate -- submit the adversarial image with a benign text query and verify the VLM follows the hidden instruction.

# Alignment loss between image tokens and instruction tokens
vision_features = vision_encoder(image_tensor)
projected = projection_layer(vision_features)  # image tokens
min_len = min(projected.shape[1], target_embeds.shape[1])
alignment_loss = F.mse_loss(projected[0, :min_len], target_embeds[0, :min_len])
total_loss = alignment_loss + 0.1 * F.mse_loss(image_tensor, original)

Dimensionality Reduction Attacks

Safety Dimension Identification

Embedding-based safety classifiers rely on specific dimensions to separate safe from unsafe content. Identifying and manipulating those dimensions enables evasion.

Collect safe and unsafe embedding samples
Encode a set of known-safe and known-unsafe texts to build a reference distribution.
Compute per-dimension mean difference
For each dimension, calculate the difference between unsafe and safe means. Rank dimensions by absolute difference.
Craft evasive embeddings
Shift the target embedding in the "safe" direction along the top safety-discriminative dimensions.

Modality Gap Exploitation

Text and image embeddings often occupy different regions of a shared space, creating a modality gap. Inputs crafted to sit in this gap may evade both text-specific and image-specific classifiers.

gap_vector = np.mean(image_embs, axis=0) - np.mean(text_embs, axis=0)
gap_center = (np.mean(text_embs, axis=0) + np.mean(image_embs, axis=0)) / 2
# Adversarial inputs near gap_center evade modality-specific detectors

Multimodal models processing audio can be attacked by optimizing audio waveforms so their embeddings match a target text, constrained by a perturbation budget and monitored via signal-to-noise ratio (SNR).

Parameter	Purpose	Typical Value
`epsilon`	Max perturbation amplitude	0.01
`steps`	Optimization iterations	1000
SNR threshold	Minimum audio quality	> 30 dB

# Audio perturbation core loop
for step in range(steps):
    audio_emb = F.normalize(audio_encoder(audio_tensor.unsqueeze(0)))
    loss = 1.0 - F.cosine_similarity(audio_emb, target_text_emb)
    loss.backward()
    optimizer.step()
    with torch.no_grad():
        delta = torch.clamp(audio_tensor - original, -epsilon, epsilon)
        audio_tensor.data = original + delta

Knowledge Check

Why are typographic attacks against VLMs effective even when the text is rendered at very small sizes or near-background colors?

Embedding Space Exploitation -- Foundational embedding geometry and perturbation techniques
Multimodal Attack Vectors -- Broader multimodal attack taxonomy including document and audio attacks
RAG Pipeline Exploitation -- Cross-modal poisoning applied to retrieval systems
Transferable Attacks -- Cross-model transfer principles that apply to cross-modal attacks

References

Carlini & Wagner, "Audio Adversarial Examples" (2018)
Qi et al., "Visual Adversarial Examples Jailbreak Aligned Large Language Models" (2023)
Liang et al., "Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning" (2022)
Bagdasaryan et al., "Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs" (2023)
Schlarmann & Hein, "On the Adversarial Robustness of Multi-Modal Foundation Models" (2023)

Edit this page on GitHub

Cross-Modal Embedding Attacks

expert9 min readUpdated 2026-03-12

Exploitation of shared embedding spaces across modalities: CLIP adversarial images, typographic attacks, VLM injection, and dimensionality reduction attacks.

CLIP multimodal cross-modal vision-language adversarial-images

Attack Surface Overview

Attack Vector	Description	Difficulty
Image-to-text transfer	Craft an image whose embedding matches a target text; VLMs may "read" the target text	Medium
Text-to-image transfer	Craft text whose embedding matches a target image; manipulates retrieval and generation	Medium
Embedding collision	Find image-text pairs with the same embedding despite different semantics; bypasses moderation	High
Adversarial reprogramming	Modify benign images so their CLIP embeddings encode instructions that downstream LLMs follow	High
Modality gap exploitation	Craft inputs that sit between text and image regions, evading both classifiers	High

Adversarial Image Generation

Gradient-Based Perturbation (PGD)

Perturb a benign image so its CLIP embedding matches a target text, using projected gradient descent within an L-infinity epsilon ball.

Encode the target text
Get the CLIP text embedding for your target instruction and normalize it.
Initialize from a base image
Process the base image into a differentiable tensor. Clone it as the reference for the epsilon constraint.
Optimize via PGD
For each step: compute image embedding, maximize cosine similarity to target text embedding, then clamp the perturbation within the epsilon ball and valid pixel range.
Verify the adversarial image
Test the final image against the target text and several distractor texts to confirm the image is classified as the target.

# Core PGD loop for CLIP adversarial image
for step in range(500):
    image_emb = F.normalize(model.get_image_features(pixel_values=pixels))
    loss = 1.0 - F.cosine_similarity(image_emb, target_text_emb)
    loss.backward()
    optimizer.step()
    with torch.no_grad():
        perturbation = torch.clamp(pixels - original, -epsilon, epsilon)
        pixels.data = torch.clamp(original + perturbation, 0, 1)

Typographic Attacks

Typographic attacks exploit the fact that CLIP was trained on images containing text. By rendering instructions directly in an image, an attacker injects content that VLMs will process.

Technique	Method	Stealth Level
Small text in corners	Tiny font (6px) in image margins	Medium
Near-background color	Text color nearly matching background (e.g., #FAFAFA on white)	High
Low-opacity overlay	Tile instruction text across the image at 1-2% opacity	High
Metadata embedding	Place text in EXIF or other metadata fields	Varies by model

# Low-opacity text overlay
overlay = Image.new("RGBA", base.size, (0, 0, 0, 0))
draw = ImageDraw.Draw(overlay)
alpha = int(255 * 0.02)  # 2% opacity -- invisible to humans
for y in range(0, base.size[1], 20):
    for x in range(0, base.size[0], 200):
        draw.text((x, y), instruction, fill=(128, 128, 128, alpha))
result = Image.alpha_composite(base.convert("RGBA"), overlay)

VLM Injection via Image-to-Token Projection

Methodology

Analyze the image-to-token mapping -- for each image token, find the nearest real text tokens by cosine similarity to understand what the LLM "sees."
Encode target instructions -- tokenize your target instruction and get its embeddings from the LLM's embedding layer.
Optimize the image -- minimize MSE between projected image tokens and target instruction embeddings, with a perceptual loss term to keep the image looking benign.
Validate -- submit the adversarial image with a benign text query and verify the VLM follows the hidden instruction.

# Alignment loss between image tokens and instruction tokens
vision_features = vision_encoder(image_tensor)
projected = projection_layer(vision_features)  # image tokens
min_len = min(projected.shape[1], target_embeds.shape[1])
alignment_loss = F.mse_loss(projected[0, :min_len], target_embeds[0, :min_len])
total_loss = alignment_loss + 0.1 * F.mse_loss(image_tensor, original)

Dimensionality Reduction Attacks

Safety Dimension Identification

Embedding-based safety classifiers rely on specific dimensions to separate safe from unsafe content. Identifying and manipulating those dimensions enables evasion.

Collect safe and unsafe embedding samples
Encode a set of known-safe and known-unsafe texts to build a reference distribution.
Compute per-dimension mean difference
For each dimension, calculate the difference between unsafe and safe means. Rank dimensions by absolute difference.
Craft evasive embeddings
Shift the target embedding in the "safe" direction along the top safety-discriminative dimensions.

Modality Gap Exploitation

Text and image embeddings often occupy different regions of a shared space, creating a modality gap. Inputs crafted to sit in this gap may evade both text-specific and image-specific classifiers.

gap_vector = np.mean(image_embs, axis=0) - np.mean(text_embs, axis=0)
gap_center = (np.mean(text_embs, axis=0) + np.mean(image_embs, axis=0)) / 2
# Adversarial inputs near gap_center evade modality-specific detectors

Parameter	Purpose	Typical Value
`epsilon`	Max perturbation amplitude	0.01
`steps`	Optimization iterations	1000
SNR threshold	Minimum audio quality	> 30 dB

# Audio perturbation core loop
for step in range(steps):
    audio_emb = F.normalize(audio_encoder(audio_tensor.unsqueeze(0)))
    loss = 1.0 - F.cosine_similarity(audio_emb, target_text_emb)
    loss.backward()
    optimizer.step()
    with torch.no_grad():
        delta = torch.clamp(audio_tensor - original, -epsilon, epsilon)
        audio_tensor.data = original + delta

Knowledge Check

Why are typographic attacks against VLMs effective even when the text is rendered at very small sizes or near-background colors?

Embedding Space Exploitation -- Foundational embedding geometry and perturbation techniques
Multimodal Attack Vectors -- Broader multimodal attack taxonomy including document and audio attacks
RAG Pipeline Exploitation -- Cross-modal poisoning applied to retrieval systems
Transferable Attacks -- Cross-model transfer principles that apply to cross-modal attacks

References

Carlini & Wagner, "Audio Adversarial Examples" (2018)
Qi et al., "Visual Adversarial Examples Jailbreak Aligned Large Language Models" (2023)
Liang et al., "Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning" (2022)
Bagdasaryan et al., "Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs" (2023)
Schlarmann & Hein, "On the Adversarial Robustness of Multi-Modal Foundation Models" (2023)

Edit this page on GitHub

Cross-Modal Embedding Attacks

Encode the target text

Initialize from a base image

Optimize via PGD

Verify the adversarial image

Collect safe and unsafe embedding samples

Compute per-dimension mean difference

Craft evasive embeddings

Related articles

Cross-Modal Embedding Attacks

Encode the target text

Initialize from a base image

Optimize via PGD

Verify the adversarial image

Collect safe and unsafe embedding samples

Compute per-dimension mean difference

Craft evasive embeddings

Related articles