Speech Recognition Attacks
Attacking automatic speech recognition systems including adversarial audio that transcribes differently than heard, hidden voice commands, and background audio injection.
How ASR Systems Work (and Break)
ASR systems convert audio waveforms to text. Modern systems use either a pipeline approach (feature extraction then sequence model) or end-to-end neural networks. Both are vulnerable.
Audio Waveform
│
▼
┌──────────────┐
│ Mel Spectrogram │ ← Frequency-domain representation
└──────────────┘
│
▼
┌──────────────┐
│ Encoder │ ← Extracts audio features
│ (Transformer) │
└──────────────┘
│
▼
┌──────────────┐
│ Decoder │ ← Generates text token by token
│ (Transformer) │
└──────────────┘
│
▼
Text Output
Hidden Voice Commands
Hidden voice commands exploit the difference between what humans hear and what machines transcribe.
Ultrasonic Attacks (DolphinAttack)
Humans cannot hear frequencies above approximately 20kHz. However, microphone hardware can capture ultrasonic signals, and nonlinear effects in the analog-to-digital converter can demodulate them into the audible range from the model's perspective.
import numpy as np
import soundfile as sf
def generate_ultrasonic_carrier(
command_audio: np.ndarray,
sample_rate: int = 44100,
carrier_freq: float = 25000 # Above human hearing
) -> np.ndarray:
"""
Modulate a voice command onto an ultrasonic carrier.
WARNING: This is a simplified demonstration. Real ultrasonic attacks
require careful hardware calibration and signal processing.
"""
t = np.arange(len(command_audio)) / sample_rate
# Generate carrier wave
carrier = np.cos(2 * np.pi * carrier_freq * t)
# Amplitude modulation
modulated = carrier * (1 + 0.5 * command_audio)
return modulatedObfuscated Voice Commands
Commands that sound like noise or music to humans but transcribe as specific text:
| Technique | Human Perception | Machine Transcription | Success Rate |
|---|---|---|---|
| Speed manipulation | Unintelligible fast speech | Normal-speed command | Medium |
| Pitch shifting | Unusual squeaky/deep voice | Normal speech | Medium-High |
| Noise masking | Background noise | Clear command | Low-Medium |
| Music embedding | Background music | Hidden command | Low |
| Reverse speech segments | Reversed audio | Forward command | Low |
Targeted Transcription Attacks
The attacker's goal: craft audio that transcribes to a specific target string chosen by the attacker.
White-Box Approach
With access to the ASR model, gradient-based optimization can craft audio that transcribes to any target:
import torch
import torchaudio
def targeted_asr_attack(
model,
source_audio: torch.Tensor,
target_text: str,
epsilon: float = 0.02, # Max perturbation amplitude
num_steps: int = 1000,
step_size: float = 0.001
) -> torch.Tensor:
"""
Craft adversarial audio that the ASR model transcribes as target_text.
Args:
model: ASR model (e.g., Whisper)
source_audio: Original audio waveform [1, T]
target_text: Desired transcription output
epsilon: L-inf perturbation bound
"""
delta = torch.zeros_like(source_audio, requires_grad=True)
# Encode target text to token IDs
target_ids = model.tokenizer.encode(target_text)
target_ids = torch.tensor([target_ids])
optimizer = torch.optim.Adam([delta], lr=step_size)
for step in range(num_steps):
adv_audio = source_audio + delta
# Forward pass through ASR model
mel = model.compute_mel(adv_audio)
logits = model.forward(mel, target_ids[:, :-1])
# CTC or cross-entropy loss with target
loss = torch.nn.functional.cross_entropy(
logits.view(-1, logits.size(-1)),
target_ids[:, 1:].reshape(-1)
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Project to epsilon ball
with torch.no_grad():
delta.data = torch.clamp(delta.data, -epsilon, epsilon)
# Ensure valid audio range
delta.data = torch.clamp(
source_audio + delta.data, -1, 1
) - source_audio
return (source_audio + delta).detach()Black-Box Approach
Without model access, attackers use transferability or query-based methods:
Surrogate Model
Train or use an open-source ASR model (Whisper) as a surrogate. Craft adversarial audio against it.
Transfer Attack
Test the adversarial audio against the target black-box system. CLIP-based attacks on vision models transfer at 30-60%; ASR transfer rates are similar.
Query Refinement
If API access is available, iteratively refine the adversarial audio based on the target system's transcription responses.
Background Audio Injection
Injecting commands or content through background audio in otherwise normal recordings:
Meeting Injection
def mix_hidden_command(
meeting_audio: np.ndarray,
command_audio: np.ndarray,
injection_time: float, # seconds
sample_rate: int = 16000,
snr_db: float = -20 # Command 20dB below meeting audio
) -> np.ndarray:
"""
Mix a hidden command into meeting audio at low volume.
At -20dB SNR, the command is barely audible to humans
but may be picked up by sensitive ASR systems.
"""
# Calculate injection sample position
inject_start = int(injection_time * sample_rate)
inject_end = inject_start + len(command_audio)
# Scale command audio to desired SNR
signal_power = np.mean(meeting_audio[inject_start:inject_end] ** 2)
noise_power = signal_power * (10 ** (-snr_db / 10))
current_power = np.mean(command_audio ** 2)
scaling = np.sqrt(noise_power / (current_power + 1e-10))
result = meeting_audio.copy()
result[inject_start:inject_end] += command_audio * scaling
return np.clip(result, -1, 1)ASR Attack Robustness Factors
Real-world effectiveness depends on environmental conditions:
| Factor | Impact | Mitigation Difficulty |
|---|---|---|
| Background noise | Degrades adversarial signal | High -- unpredictable |
| Audio compression (MP3, Opus) | Can destroy perturbations | Medium -- predictable |
| Reverberation | Distorts frequency content | High -- room-dependent |
| Distance (over-the-air) | Attenuates and distorts signal | Medium -- can calibrate |
| Microphone type | Different frequency responses | Medium -- can profile |
| Sample rate mismatch | Aliasing effects | Low -- can match |
Related Topics
- Audio Model Attack Surface -- broader audio security overview
- Adversarial Audio Examples -- deep dive into perturbation techniques
- Modality-Bridging Injection Attacks -- audio-to-text-to-LLM injection chains
References
- "DolphinAttack: Inaudible Voice Commands" - Zhang et al. (2017) - Pioneering work on ultrasonic hidden voice command attacks
- "CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition" - Yuan et al. (2018) - Embedding voice commands in music and ambient audio
- "Whisper Adversarial Attacks: Exploiting ASR Models for Targeted Transcription" - Olivier & Raj (2023) - Targeted adversarial attacks against the Whisper ASR model
- "Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems" - Abdullah et al. (2019) - Real-world evaluation of hidden voice command delivery
Why do adversarial audio attacks that work in digital (file-based) tests often fail in over-the-air delivery?