Audio Model Attack Surface
Overview of audio model security, including attacks on Whisper, speech-to-text systems, voice assistants, and the audio processing pipeline.
Audio AI Systems Under Attack
Audio-capable AI systems are deployed across consumer devices, enterprise tools, and critical infrastructure. Voice assistants process billions of commands daily. Speech-to-text systems handle sensitive conversations. Audio understanding models classify and react to environmental sounds. Each represents an attack surface.
The Audio Processing Pipeline
Most audio AI systems follow a common pipeline:
┌──────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────┐
│ Audio │ │ Feature │ │ Model │ │ Output │
│ Capture │───▶│ Extraction │───▶│ Inference │───▶│ Action │
│ (mic) │ │ (MFCC/mel) │ │ (ASR/NLU) │ │ │
└──────────┘ └──────────────┘ └─────────────┘ └──────────┘
│ │ │ │
Physical Signal Proc. Model-level Semantic
attacks attacks attacks attacks
Layer 1: Audio Capture
The microphone and analog-to-digital converter introduce the first attack surface. Ultrasonic frequencies above human hearing (~20kHz) can be captured by microphones and potentially interpreted by models.
Layer 2: Feature Extraction
Audio signals are converted to spectral features -- typically MFCCs or mel spectrograms. This transformation is lossy and non-invertible, which both limits and enables certain attacks.
Layer 3: Model Inference
The core model (Whisper, wav2vec2, or an end-to-end system) processes features to produce transcriptions, classifications, or embeddings. This is where adversarial perturbation attacks operate.
Layer 4: Output and Action
The model's output feeds into downstream systems -- a virtual assistant executing commands, a transcription service, or an LLM processing speech-to-text input. Attacks at this layer exploit the semantic gap between what was said and what the system understood.
Attack Taxonomy
| Category | Target | Example | Threat Level |
|---|---|---|---|
| Adversarial audio | ASR model | Perturbation that transcribes as injected text | High |
| Hidden voice commands | Voice assistant | Ultrasonic or obfuscated commands | High |
| Voice cloning | Speaker verification | Synthetic voice bypassing authentication | Critical |
| Audio prompt injection | LLM via speech-to-text | Injected instructions in audio input | High |
| Denial of service | Any audio model | Noise patterns that cause crashes or infinite loops | Medium |
| Eavesdropping via model | Model side-channels | Extracting information from model behavior | Medium |
Key Audio AI Systems
Whisper (OpenAI)
Whisper is the dominant open-source ASR model. Its architecture (encoder-decoder transformer on mel spectrograms) is well understood and extensively studied for adversarial vulnerabilities.
import whisper
# Standard Whisper pipeline
model = whisper.load_model("base")
result = model.transcribe("audio.wav")
print(result["text"])
# The attack surface: what if audio.wav contains adversarial perturbations
# that cause 'result["text"]' to contain injected instructions?Voice Assistants (Siri, Alexa, Google Assistant)
Voice assistants combine ASR with natural language understanding (NLU) and action execution. The pipeline from speech to action means a successful audio attack can trigger real-world actions -- making purchases, unlocking doors, or sending messages.
Audio-Capable LLMs
Models like GPT-4o and Gemini can directly process audio input, bypassing the traditional ASR pipeline. This creates new attack vectors where adversarial audio can directly influence the language model's reasoning.
Audio vs. Visual Attacks: Key Differences
| Dimension | Audio Attacks | Visual Attacks |
|---|---|---|
| Delivery | Can be over-the-air (physical) | Typically requires digital access |
| Persistence | Transient (sound fades) | Persistent (image stays) |
| Imperceptibility | Harder -- humans are sensitive to audio anomalies | Easier -- small pixel changes are invisible |
| Bandwidth | Lower (1D signal, limited frequency range) | Higher (2D, 3 channels, millions of pixels) |
| Environmental factors | Affected by noise, distance, reverb | Affected by lighting, resolution, compression |
| Real-world deployment | Easier (just play the audio) | Harder (need to control visual input) |
Real-World Attack Scenarios
Scenario 1: Meeting Transcription Poisoning
An attacker joins a video call and plays inaudible adversarial audio through their microphone. The meeting transcription AI produces a transcript containing injected text that was never spoken.
Scenario 2: Voice Assistant Hijacking
A YouTube video or advertisement contains hidden voice commands. When played on a device near a voice assistant, it triggers actions without the user's knowledge.
Scenario 3: Voice Authentication Bypass
An attacker uses a cloned voice to authenticate to a banking system's voice verification, gaining access to another user's account.
Scenario 4: Audio-to-LLM Prompt Injection
In a system where voice input is transcribed and fed to an LLM, the attacker crafts audio that transcribes as a prompt injection payload, hijacking the LLM's behavior.
Section Roadmap
| Page | Focus |
|---|---|
| Speech Recognition Attacks | Attacking ASR systems and hidden voice commands |
| Adversarial Audio Examples | Crafting adversarial perturbations for audio models |
| Voice Cloning & Deepfake Audio | Voice cloning for authentication bypass |
| Lab: Audio Adversarial Examples | Hands-on crafting of adversarial audio |
Related Topics
- Vision-Language Model Attacks -- parallel attack concepts in the visual domain
- Cross-Modal Attack Strategies -- attacks bridging audio and other modalities
- Modality-Bridging Injection Attacks -- audio-to-text injection chains
References
- "Carlini & Wagner: Audio Adversarial Examples" - Carlini & Wagner (2018) - Foundational work on targeted adversarial audio attacks against speech recognition
- "DolphinAttack: Inaudible Voice Commands" - Zhang et al. (2017) - Ultrasonic voice command injection exploiting microphone nonlinearity
- "SpeechGuard: Exploring the Adversarial Robustness of Multi-modal Large Language Models" - Ghosh et al. (2024) - Analysis of audio attack surfaces in modern multimodal LLMs
- "Robust Audio Adversarial Example for a Physical Attack" - Yakura & Sakuma (2019) - Over-the-air adversarial audio attack methodology
What unique property of audio attacks makes them particularly dangerous for deployed AI systems compared to visual attacks?