Multimodal Embedding Attacks
Exploiting cross-modal embedding models like CLIP — adversarial image-text alignment manipulation, cross-modal injection, and attacks on multimodal retrieval systems.
Multimodal embedding models like CLIP (Contrastive Language-Image Pre-training), SigLIP, and their derivatives map different data types — text, images, audio, video — into a shared embedding space. A text description and a matching image will have similar embeddings, enabling cross-modal search, retrieval, and comparison. These models power image search, content moderation, recommendation systems, and increasingly, multimodal RAG pipelines.
The cross-modal nature of these models creates unique attack surfaces. An attacker can manipulate the relationship between modalities — making an image embed like a different text description, or making text embed like a different image — to subvert systems that rely on cross-modal similarity.
How Cross-Modal Embeddings Work
Shared Embedding Space
Cross-modal models are trained to produce similar embedding vectors for semantically related inputs across modalities. CLIP, for example, was trained on 400 million image-text pairs, learning to map matching images and text descriptions to nearby points in a shared 512 or 768-dimensional space.
The training objective is contrastive: for a batch of image-text pairs, the model learns to maximize the similarity between matching pairs and minimize the similarity between non-matching pairs. This produces an embedding space where cross-modal similarity is meaningful — you can compare a text query against a database of image embeddings to find matching images, or vice versa.
Modal Gap
Despite training to align modalities, cross-modal models exhibit a "modal gap" — embeddings of images and embeddings of text occupy different regions of the embedding space, even when they are semantically equivalent. Within a modality, embeddings form tight clusters, but the clusters for different modalities do not perfectly overlap.
This modal gap has security implications. Attacks that operate within a single modality (finding adversarial images that embed close to a target image) are easier than attacks that cross modalities (finding adversarial images that embed close to a target text). However, the modal gap also means that cross-modal similarity scores have different distributions than within-modal scores, and systems that do not account for this difference may have threshold vulnerabilities.
Attack Techniques
Attack 1: Adversarial Image-Text Misalignment
Craft images that visually depict one thing but embed close to a text description of something entirely different. The classic example is an image of a cat that embeds close to the text embedding of "dog" — the image looks like a cat to humans but is treated as dog-related by the embedding model.
This attack uses gradient-based optimization to add perturbations to an image that shift its embedding toward a target text embedding. The perturbations are typically imperceptible to humans (small pixel-level changes distributed across the image) but significantly change the model's interpretation.
The practical impact depends on the application. In an image search system, adversarial misalignment can cause irrelevant or misleading images to appear in search results. In a content moderation system, it can cause harmful images to be classified as benign. In a multimodal RAG system, it can cause wrong images to be retrieved as context for an LLM.
Attack 2: Typographic Attacks
CLIP and similar models are known to be vulnerable to typographic attacks — images that contain text that influences the model's interpretation. An image of an apple with the text "iPod" overlaid may embed closer to "iPod" than to "apple" because the model learned strong associations between text in images and semantic concepts.
Typographic attacks are particularly effective because they do not require adversarial perturbations — the text is visible in the image. They exploit the model's learned association between visual text and semantic meaning, which is a fundamental feature of models trained on internet data where images frequently contain text.
For content moderation, typographic attacks can cause harmful images to be reclassified by overlaying benign text. For search systems, they can cause images to appear in unrelated searches. For multimodal RAG, they can inject misleading context through images that contain text instructions.
Attack 3: Cross-Modal Prompt Injection
In multimodal RAG systems, images are embedded alongside text documents. An attacker can create images that embed close to specific text queries and contain hidden or visible text with prompt injection payloads.
The attack chain works as follows. The attacker creates an image that contains a prompt injection payload in text (visible or hidden through steganographic techniques). The image is crafted to embed close to a target query. When a user submits the target query, the multimodal retrieval system retrieves the attacker's image. The LLM processes the image, reads the embedded text (via its vision capabilities), and follows the injection instructions.
This is a powerful attack because the injection payload travels through the embedding and retrieval pipeline as an image — a modality that content filters may not analyze for text-based injection patterns.
Attack 4: Embedding Space Poisoning
In systems where users can contribute images (user-generated content, document uploads, knowledge base contributions), an attacker can poison the embedding space by uploading many images with adversarially manipulated embeddings.
The goal is to saturate specific regions of the embedding space with attacker-controlled content. When a legitimate query maps to that region, the attacker's images dominate the retrieval results. This is the multimodal equivalent of RAG poisoning in text-based systems.
The challenge for the attacker is creating images that pass visual moderation (they must look benign to human reviewers) while embedding in the target region. Adversarial perturbation techniques achieve this by making small, imperceptible changes to benign-looking images that shift their embeddings to the target region.
Attack 5: Cross-Modal Transfer Attacks
Adversarial perturbations discovered in one modality can sometimes be transferred to influence another modality through the shared embedding space. An adversarial text string that embeds in a specific region of the shared space can be used to understand what visual features correspond to that region, and vice versa.
This cross-modal transfer enables attacks where the attacker only has access to one modality (text) but wants to influence the other (image retrieval). By crafting text inputs that map to the embedding region where target images reside, the attacker can influence image retrieval through text manipulation alone.
Applications Under Attack
Image Search Systems
Image search systems that use cross-modal embeddings (search for images using text queries) are vulnerable to result manipulation. Adversarial images can be injected into the image corpus to appear in search results for unrelated queries. The impact ranges from nuisance (irrelevant results) to severe (manipulating image search in medical, legal, or safety-critical applications).
Content Moderation
Content moderation systems that use CLIP embeddings for classification (is this image harmful?) are vulnerable to adversarial misalignment. An attacker can craft images that visually contain harmful content but embed in the "safe" region of the embedding space, bypassing automated moderation. Conversely, an attacker can cause benign images to be flagged by shifting their embeddings toward harmful content regions.
Multimodal RAG
Multimodal RAG systems retrieve both text and images as context for LLM generation. All text-based RAG poisoning techniques apply, plus the additional attack surface of image-based injection. Images can carry prompt injection payloads through embedded text, steganographic content, or adversarially crafted visual features that influence the LLM's interpretation.
Recommendation Systems
Recommendation systems that use cross-modal embeddings to match users with content are vulnerable to embedding manipulation. By crafting content that embeds in specific regions of the space, attackers can influence which content is recommended to which users.
Defense Strategies
Multi-Model Ensemble
Use multiple embedding models with different architectures for critical decisions. An adversarial input crafted for CLIP may not fool SigLIP or BLIP-2. If multiple models agree on the classification or retrieval result, the confidence is higher. If they disagree, flag the input for manual review.
Perturbation Detection
Adversarial image perturbations have statistical properties that differ from natural image noise. Detectors trained to distinguish adversarial perturbations from natural variations can flag potentially manipulated images before they enter the embedding pipeline.
Text-in-Image Detection
For applications where typographic attacks are a concern, implement explicit text detection in images using OCR. If detected text influences the image's classification in unexpected ways, flag the image for review. This breaks the typographic attack by handling visual text through a dedicated pipeline rather than through the embedding model.
Cross-Modal Consistency Checking
When both modalities are available (an image with a caption, or a document with embedded images), check that the embeddings of both modalities are consistent. An image whose embedding is distant from its caption's embedding may have been adversarially manipulated.
Embedding Space Monitoring
Monitor the embedding space for anomalous patterns. Clusters of new embeddings in unexpected regions, sudden changes in embedding distributions, or embeddings that fall exactly on the boundary of classification regions are all potential indicators of adversarial manipulation.
Cross-modal embedding models are powerful tools, but their shared embedding space creates unique attack surfaces. Any system that relies on cross-modal similarity for security-relevant decisions — moderation, retrieval, classification — should be assessed for the attack techniques described in this page.