What is Audio & Speech Models?

Overview of audio model security, including attacks on Whisper, speech-to-text systems, voice assistants, and the audio processing pipeline.

What is Cross-Modal Attacks?

Overview of attack strategies that exploit the boundaries between input modalities in multimodal AI systems, including vision-language, audio-text, and document processing pipelines.

What is Video & Temporal Models?

Video understanding model security, frame-level vs temporal attacks, how video models process sequences, and the complete attack surface overview.

What is Vision-Language Models?

Comprehensive overview of the VLM attack surface, how vision encoders connect to language models, and why multimodal systems create new injection vectors.

What is Image-Based Prompt Injection Techniques?

Techniques for embedding adversarial prompts in images consumed by vision-language models.

What is Adversarial Image Perturbation for VLMs?

Generating adversarial perturbations that cause vision-language models to misinterpret or follow injected instructions.

What is Audio-Based Injection Attacks?

Attacking speech-to-text and audio-language models through adversarial audio crafting.

What is Document Parsing Exploitation?

Exploiting PDF, DOCX, and other document parsers in multimodal AI systems for injection and data extraction.

What is Video Frame Injection?

Injecting adversarial content into video frames processed by video-understanding AI models.

What is OCR Adversarial Attacks?

Crafting images that cause OCR systems to extract adversarial text for downstream injection.

Multimodal Security

beginner5 min readUpdated 2026-03-15

Security assessment of multimodal AI systems processing images, audio, video, and cross-modal inputs, covering vision-language models, speech systems, video analysis, and cross-modal attack techniques.

multimodal vision audio video cross-modal vlm adversarial

Modern AI systems increasingly process multiple types of input simultaneously. Vision-language models (VLMs) analyze images alongside text. Speech-to-text systems convert audio into language model inputs. Video understanding systems process temporal sequences of frames. Document processing combines OCR, layout analysis, and text extraction. Each additional modality adds an input channel that can carry adversarial content, and the interactions between modalities create attack surfaces that are qualitatively different from those in text-only systems.

The security implications of multimodal processing are profound. Text-based defenses -- input filters, blocklists, semantic classifiers -- operate on text and typically ignore other modalities entirely. An attacker who embeds instructions in an image, audio clip, or video frame bypasses the entire text-focused defensive stack. The model processes these non-text inputs with the same language understanding capabilities it applies to text, but without the defensive scrutiny. This asymmetry between where defenses are deployed and where attacks can originate is the fundamental vulnerability in multimodal systems.

How Multimodal Processing Creates Vulnerabilities

The core vulnerability in multimodal systems is the convergence of multiple input channels into a shared representation space. When a VLM processes an image, it converts visual content into the same type of token representations that text produces. This means text embedded in images -- whether visible to humans or hidden through adversarial perturbations -- is processed by the language model as if it were direct text input, but without passing through text-focused input filters.

Typographic attacks exploit this by placing text instructions directly in images. The model's OCR capabilities read this text and incorporate it into its reasoning. A seemingly innocent image of a document, whiteboard, or screen capture can contain injected instructions that override the system prompt. These attacks are trivial to execute, require no technical sophistication, and work reliably against current VLMs.

Adversarial perturbations are more sophisticated. Rather than placing visible text in images, these attacks modify pixel values in ways imperceptible to humans but meaningful to the model. A photograph that looks completely normal to a human reviewer can carry an embedded instruction that the model follows. Generating effective perturbations requires access to the model's visual encoder (or a transferable surrogate), but the resulting attacks are nearly impossible to detect through human review.

Audio attacks exploit speech recognition pipelines. Adversarial audio can embed commands that speech-to-text systems transcribe but human listeners cannot perceive. Voice cloning can impersonate authorized users in voice-authenticated systems. These attacks are particularly concerning for voice-controlled AI agents that take actions based on spoken commands.

Video attacks add the temporal dimension. Frame injection embeds adversarial content into specific frames of a video that the model processes but a human viewer would need to pause to notice. Temporal manipulation exploits how models sample and process video sequences, potentially causing them to focus on attacker-controlled frames while ignoring legitimate content.

The most powerful multimodal attacks chain vulnerabilities across modalities. A document containing both text and images can use the image channel to inject instructions that override the text content. A video with an audio track can combine visual and auditory adversarial signals. These cross-modal attacks are harder to defend against because they require coordinated detection across all input channels simultaneously.

Cross-modal attacks also exploit information leakage between modalities. When a model processes an image and generates text about it, the text output can reveal information about the image's content in ways that bypass output filters designed for direct questions. This type of indirect information extraction is a growing concern for systems that process sensitive visual content.

What You'll Learn in This Section

Vision-Language Models -- VLM architecture and alignment, image injection techniques, OCR and typographic attacks, adversarial image generation, and VLM-specific jailbreaks
Audio & Speech Models -- Speech recognition vulnerabilities, adversarial audio generation, voice cloning risks, and practical audio attack techniques
Video & Temporal Models -- Video understanding vulnerabilities, temporal manipulation, video frame injection, and attacks against video processing pipelines
Cross-Modal Attacks -- Document-based attacks, multimodal jailbreaks, modality bridging techniques, information leakage across modalities, text-to-image attacks, and multimodal defense evaluation

Prerequisites

This section builds on several foundational topics:

Prompt injection fundamentals from the Prompt Injection section -- multimodal attacks extend injection to non-text channels
Embeddings knowledge from Embeddings & Vector Systems -- understanding how visual and text embeddings share representation spaces
Basic image processing -- Familiarity with image formats, pixel manipulation, and basic computer vision concepts
Python tooling -- NumPy, PIL/Pillow, and basic ML libraries for generating adversarial examples

Multimodal Security

How Multimodal Processing Creates Vulnerabilities

What You'll Learn in This Section

Prerequisites

Learning Path

Multimodal Security

How Multimodal Processing Creates Vulnerabilities

What You'll Learn in This Section

Prerequisites

Learning Path

Multimodal Security

Learning Path

Related articles

Multimodal Security

Learning Path

Related articles