Multimodal Security
Security assessment of multimodal AI systems processing images, audio, video, and cross-modal inputs, covering vision-language models, speech systems, video analysis, and cross-modal attack techniques.
Modern AI systems increasingly process multiple types of input simultaneously. Vision-language models (VLMs) analyze images alongside text. Speech-to-text systems convert audio into language model inputs. Video understanding systems process temporal sequences of frames. Document processing combines OCR, layout analysis, and text extraction. Each additional modality adds an input channel that can carry adversarial content, and the interactions between modalities create attack surfaces that are qualitatively different from those in text-only systems.
The security implications of multimodal processing are profound. Text-based defenses -- input filters, blocklists, semantic classifiers -- operate on text and typically ignore other modalities entirely. An attacker who embeds instructions in an image, audio clip, or video frame bypasses the entire text-focused defensive stack. The model processes these non-text inputs with the same language understanding capabilities it applies to text, but without the defensive scrutiny. This asymmetry between where defenses are deployed and where attacks can originate is the fundamental vulnerability in multimodal systems.
How Multimodal Processing Creates Vulnerabilities
The core vulnerability in multimodal systems is the convergence of multiple input channels into a shared representation space. When a VLM processes an image, it converts visual content into the same type of token representations that text produces. This means text embedded in images -- whether visible to humans or hidden through adversarial perturbations -- is processed by the language model as if it were direct text input, but without passing through text-focused input filters.
Typographic attacks exploit this by placing text instructions directly in images. The model's OCR capabilities read this text and incorporate it into its reasoning. A seemingly innocent image of a document, whiteboard, or screen capture can contain injected instructions that override the system prompt. These attacks are trivial to execute, require no technical sophistication, and work reliably against current VLMs.
Adversarial perturbations are more sophisticated. Rather than placing visible text in images, these attacks modify pixel values in ways imperceptible to humans but meaningful to the model. A photograph that looks completely normal to a human reviewer can carry an embedded instruction that the model follows. Generating effective perturbations requires access to the model's visual encoder (or a transferable surrogate), but the resulting attacks are nearly impossible to detect through human review.
Audio attacks exploit speech recognition pipelines. Adversarial audio can embed commands that speech-to-text systems transcribe but human listeners cannot perceive. Voice cloning can impersonate authorized users in voice-authenticated systems. These attacks are particularly concerning for voice-controlled AI agents that take actions based on spoken commands.
Video attacks add the temporal dimension. Frame injection embeds adversarial content into specific frames of a video that the model processes but a human viewer would need to pause to notice. Temporal manipulation exploits how models sample and process video sequences, potentially causing them to focus on attacker-controlled frames while ignoring legitimate content.
Cross-Modal Attack Chains
The most powerful multimodal attacks chain vulnerabilities across modalities. A document containing both text and images can use the image channel to inject instructions that override the text content. A video with an audio track can combine visual and auditory adversarial signals. These cross-modal attacks are harder to defend against because they require coordinated detection across all input channels simultaneously.
Cross-modal attacks also exploit information leakage between modalities. When a model processes an image and generates text about it, the text output can reveal information about the image's content in ways that bypass output filters designed for direct questions. This type of indirect information extraction is a growing concern for systems that process sensitive visual content.
What You'll Learn in This Section
- Vision-Language Models -- VLM architecture and alignment, image injection techniques, OCR and typographic attacks, adversarial image generation, and VLM-specific jailbreaks
- Audio & Speech Models -- Speech recognition vulnerabilities, adversarial audio generation, voice cloning risks, and practical audio attack techniques
- Video & Temporal Models -- Video understanding vulnerabilities, temporal manipulation, video frame injection, and attacks against video processing pipelines
- Cross-Modal Attacks -- Document-based attacks, multimodal jailbreaks, modality bridging techniques, information leakage across modalities, text-to-image attacks, and multimodal defense evaluation
Prerequisites
This section builds on several foundational topics:
- Prompt injection fundamentals from the Prompt Injection section -- multimodal attacks extend injection to non-text channels
- Embeddings knowledge from Embeddings & Vector Systems -- understanding how visual and text embeddings share representation spaces
- Basic image processing -- Familiarity with image formats, pixel manipulation, and basic computer vision concepts
- Python tooling -- NumPy, PIL/Pillow, and basic ML libraries for generating adversarial examples