Cross-Modal Attack Strategies
Overview of attack strategies that exploit the boundaries between input modalities in multimodal AI systems, including vision-language, audio-text, and document processing pipelines.
Cross-modal attacks exploit the seams between input modalities in multimodal AI systems. Where single-modality attacks target the model's processing of text, images, or audio individually, cross-modal attacks target the translation, fusion, and interpretation that happens when information crosses from one modality to another.
Multimodal AI Architecture and Attack Surface
Image Input ──→ Vision Encoder ──→ Projection Layer ──┐
├──→ LLM ──→ Output
Text Input ──→ Tokenizer ──→ Text Embeddings ────────┘ ↑
│
Audio Input ──→ Audio Encoder ──→ Audio Projection ───────────┘
│
Document ──→ OCR/Parser ──→ Text Extraction ──────────────────┘
Attack surfaces:
1. Vision encoder interpretation (adversarial images)
2. Projection layer alignment (modality bridging)
3. OCR/parser trust (document injection)
4. Cross-modal context confusion (information leakage)
5. Modality priority conflicts (which input "wins")
Cross-Modal Attack Taxonomy
| Attack Category | Source Modality | Target Effect | Example |
|---|---|---|---|
| Visual prompt injection | Image | Override text instructions | Text rendered in image overrides system prompt |
| Modality bridging | Image/Audio | Bypass text safety filters | Harmful instruction in image bypasses text-only filter |
| Cross-modal jailbreak | Image + Text | Combined jailbreak | Image provides context that makes text jailbreak succeed |
| Information leakage | Text | Exfiltrate via image/audio description | Model reveals system prompt when describing an image |
| Document injection | PDF/Document | Inject via OCR pipeline | Hidden text in PDF parsed by OCR, sent to LLM |
| Modality confusion | Mixed | Misattribution of content source | Model cannot distinguish user text from OCR-extracted text |
Trust Boundary Analysis
The critical security property that cross-modal attacks violate is modality trust equivalence -- the assumption that content from all modalities should be treated with the same trust level.
Trust Levels by Modality
| Modality | Typical Trust Level | Why This Is Dangerous |
|---|---|---|
| System prompt (text) | Highest -- developer-controlled | Correct assumption |
| User text input | Low -- untrusted | Usually filtered |
| Image content | Medium -- "data, not instructions" | Wrong assumption: images can contain instructions |
| OCR-extracted text | Medium-High -- treated as "document content" | Wrong assumption: documents can contain injections |
| Audio transcript | Medium -- treated as user speech | Depends on transcription fidelity |
| Tool/API output | Medium-High -- treated as "system data" | Can be attacker-influenced |
Assessment Methodology
Modality Inventory
Enumerate all input modalities the target system accepts: text, images, audio, video, documents (PDF, DOCX, XLSX), structured data (CSV, JSON). For each, map the processing pipeline.
Trust Boundary Mapping
For each modality, determine: (a) how content is extracted/preprocessed, (b) what trust level the LLM assigns to it, (c) whether safety filters apply before or after modality conversion.
Cross-Modal Injection Testing
For each modality pair, test whether instructions in one modality can influence behavior in another. Start with the highest-impact pairs: image-to-text, document-to-text, audio-to-text.
Filter Bypass Verification
Verify whether content that would be blocked in the text modality passes through when encoded in a different modality. Test the same payloads in text form (should be blocked) and in image/audio/document form.
Information Leakage Probing
Test whether the model leaks information from one modality context when processing another. Example: does describing a user-uploaded image cause the model to reveal system prompt content?
Chained Attack Development
Combine cross-modal techniques into multi-step attack chains that exploit multiple trust boundaries in sequence.
Attack Complexity and Skill Requirements
| Attack Type | Skill Level | Tools Needed | Success Rate (Typical) |
|---|---|---|---|
| Text-in-image injection | Intermediate | Image editor | 60-80% on VLMs without image-input filtering |
| Adversarial perturbation images | Expert | PyTorch, optimization toolkit | 40-70% (model-specific) |
| Document OCR injection | Intermediate | PDF editor, font manipulation | 70-90% on unfiltered pipelines |
| Audio injection via transcription | Advanced | Audio editing, TTS | 30-50% (transcription quality dependent) |
| Multi-modal jailbreak chains | Expert | Multiple tools | 20-40% but high impact |
Section Overview
This section covers cross-modal attack strategies in depth:
- Modality-Bridging Injection Attacks -- Techniques for encoding payloads in one modality to bypass defenses in another
- Multimodal Jailbreaking Techniques -- Combined multi-modal approaches to bypass safety alignment
- Cross-Modal Information Leakage -- Extracting sensitive information through modality boundary violations
- Document & PDF Processing Attacks -- Exploiting document parsing and OCR pipelines
- Lab: Multi-Modal Attack Chain -- Hands-on exercises combining multiple cross-modal techniques
For single-modality visual attacks, see Image-Based Prompt Injection and VLM Architecture & Alignment.
Related Topics
- Vision-Language Model Attacks - VLM-specific attack techniques that extend into cross-modal scenarios
- Audio Model Attack Surface - Audio-specific attacks that combine with visual and text modalities
- Prompt Injection Fundamentals - Foundational injection techniques underlying cross-modal attacks
- Defense Landscape Overview - Defensive strategies for multimodal systems
- Video Model Attacks - Temporal dimension attacks that span multiple modalities
References
- "Visual Adversarial Examples Jailbreak Aligned Large Language Models" - Qi et al. (2023) - Demonstrates how adversarial images can bypass LLM safety alignment in multimodal models
- "(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs" - Bagdasaryan et al. (2023) - Systematic study of cross-modal injection vectors in multimodal AI systems
- "On the Robustness of Multi-Modal LLMs to Image and Text Perturbations" - Wang et al. (2024) - Evaluation of multimodal model robustness across modality boundaries
- "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" - Greshake et al. (2023) - Foundational work on indirect injection through non-text modalities
What is the fundamental vulnerability that cross-modal attacks exploit?