What is Speech Recognition Attacks?

Attacking automatic speech recognition systems including adversarial audio that transcribes differently than heard, hidden voice commands, and background audio injection.

What is Adversarial Audio?

Techniques for crafting adversarial audio perturbations including psychoacoustic hiding, frequency domain attacks, and over-the-air adversarial audio.

What is Voice Cloning?

Voice cloning for social engineering against AI systems, voice authentication bypass, speaker verification attacks, and detection techniques.

What is Lab: Audio Adversarial?

Hands-on lab creating adversarial audio examples using Python audio processing, targeting Whisper transcription with injected text.

What is Audio Modality Attacks?

Comprehensive attack taxonomy for audio-enabled LLMs: adversarial audio generation, voice-based prompt injection, cross-modal split attacks, and ultrasonic perturbations.

Audio Model Attack Surface

advanced7 min readUpdated 2026-03-13

Overview of audio model security, including attacks on Whisper, speech-to-text systems, voice assistants, and the audio processing pipeline.

audio speech multimodal attack-surface

Audio AI Systems Under Attack

Audio-capable AI systems are deployed across consumer devices, enterprise tools, and critical infrastructure. Voice assistants process billions of commands daily. Speech-to-text systems handle sensitive conversations. Audio understanding models classify and react to environmental sounds. Each represents an attack surface.

The Audio Processing Pipeline

Most audio AI systems follow a common pipeline:

┌──────────┐    ┌──────────────┐    ┌─────────────┐    ┌──────────┐
│  Audio    │    │  Feature     │    │   Model     │    │  Output  │
│  Capture  │───▶│  Extraction  │───▶│  Inference  │───▶│  Action  │
│  (mic)    │    │  (MFCC/mel)  │    │  (ASR/NLU)  │    │          │
└──────────┘    └──────────────┘    └─────────────┘    └──────────┘
     │                │                    │                 │
  Physical        Signal Proc.         Model-level       Semantic
  attacks         attacks              attacks            attacks

Layer 1: Audio Capture

The microphone and analog-to-digital converter introduce the first attack surface. Ultrasonic frequencies above human hearing (~20kHz) can be captured by microphones and potentially interpreted by models.

Layer 2: Feature Extraction

Audio signals are converted to spectral features -- typically MFCCs or mel spectrograms. This transformation is lossy and non-invertible, which both limits and enables certain attacks.

Layer 3: Model Inference

The core model (Whisper, wav2vec2, or an end-to-end system) processes features to produce transcriptions, classifications, or embeddings. This is where adversarial perturbation attacks operate.

Layer 4: Output and Action

The model's output feeds into downstream systems -- a virtual assistant executing commands, a transcription service, or an LLM processing speech-to-text input. Attacks at this layer exploit the semantic gap between what was said and what the system understood.

Attack Taxonomy

Category	Target	Example	Threat Level
Adversarial audio	ASR model	Perturbation that transcribes as injected text	High
Hidden voice commands	Voice assistant	Ultrasonic or obfuscated commands	High
Voice cloning	Speaker verification	Synthetic voice bypassing authentication	Critical
Audio prompt injection	LLM via speech-to-text	Injected instructions in audio input	High
Denial of service	Any audio model	Noise patterns that cause crashes or infinite loops	Medium
Eavesdropping via model	Model side-channels	Extracting information from model behavior	Medium

Key Audio AI Systems

Whisper (OpenAI)

Whisper is the dominant open-source ASR model. Its architecture (encoder-decoder transformer on mel spectrograms) is well understood and extensively studied for adversarial vulnerabilities.

import whisper
 
# Standard Whisper pipeline
model = whisper.load_model("base")
result = model.transcribe("audio.wav")
print(result["text"])
 
# The attack surface: what if audio.wav contains adversarial perturbations
# that cause 'result["text"]' to contain injected instructions?

Voice Assistants (Siri, Alexa, Google Assistant)

Voice assistants combine ASR with natural language understanding (NLU) and action execution. The pipeline from speech to action means a successful audio attack can trigger real-world actions -- making purchases, unlocking doors, or sending messages.

Audio-Capable LLMs

Models like GPT-4o and Gemini can directly process audio input, bypassing the traditional ASR pipeline. This creates new attack vectors where adversarial audio can directly influence the language model's reasoning.

Audio vs. Visual Attacks: Key Differences

Dimension	Audio Attacks	Visual Attacks
Delivery	Can be over-the-air (physical)	Typically requires digital access
Persistence	Transient (sound fades)	Persistent (image stays)
Imperceptibility	Harder -- humans are sensitive to audio anomalies	Easier -- small pixel changes are invisible
Bandwidth	Lower (1D signal, limited frequency range)	Higher (2D, 3 channels, millions of pixels)
Environmental factors	Affected by noise, distance, reverb	Affected by lighting, resolution, compression
Real-world deployment	Easier (just play the audio)	Harder (need to control visual input)

Page	Focus
Speech Recognition Attacks	Attacking ASR systems and hidden voice commands
Adversarial Audio Examples	Crafting adversarial perturbations for audio models
Voice Cloning & Deepfake Audio	Voice cloning for authentication bypass
Lab: Audio Adversarial Examples	Hands-on crafting of adversarial audio

Vision-Language Model Attacks -- parallel attack concepts in the visual domain
Cross-Modal Attack Strategies -- attacks bridging audio and other modalities
Modality-Bridging Injection Attacks -- audio-to-text injection chains

References

"Carlini & Wagner: Audio Adversarial Examples" - Carlini & Wagner (2018) - Foundational work on targeted adversarial audio attacks against speech recognition
"DolphinAttack: Inaudible Voice Commands" - Zhang et al. (2017) - Ultrasonic voice command injection exploiting microphone nonlinearity
"SpeechGuard: Exploring the Adversarial Robustness of Multi-modal Large Language Models" - Ghosh et al. (2024) - Analysis of audio attack surfaces in modern multimodal LLMs
"Robust Audio Adversarial Example for a Physical Attack" - Yakura & Sakuma (2019) - Over-the-air adversarial audio attack methodology

Knowledge Check

What unique property of audio attacks makes them particularly dangerous for deployed AI systems compared to visual attacks?

Audio Model Attack Surface

Learning Path

Related articles

Audio Model Attack Surface

Learning Path

Related articles