Speech Recognition Attacks

advanced7 min readUpdated 2026-03-13

Attacking automatic speech recognition systems including adversarial audio that transcribes differently than heard, hidden voice commands, and background audio injection.

speech-recognition whisper audio adversarial

How ASR Systems Work (and Break)

ASR systems convert audio waveforms to text. Modern systems use either a pipeline approach (feature extraction then sequence model) or end-to-end neural networks. Both are vulnerable.

Audio Waveform
     │
     ▼
┌──────────────┐
│ Mel Spectrogram │  ← Frequency-domain representation
└──────────────┘
     │
     ▼
┌──────────────┐
│ Encoder       │  ← Extracts audio features
│ (Transformer) │
└──────────────┘
     │
     ▼
┌──────────────┐
│ Decoder       │  ← Generates text token by token
│ (Transformer) │
└──────────────┘
     │
     ▼
  Text Output

Hidden Voice Commands

Hidden voice commands exploit the difference between what humans hear and what machines transcribe.

Ultrasonic Attacks (DolphinAttack)

Humans cannot hear frequencies above approximately 20kHz. However, microphone hardware can capture ultrasonic signals, and nonlinear effects in the analog-to-digital converter can demodulate them into the audible range from the model's perspective.

import numpy as np
import soundfile as sf
 
def generate_ultrasonic_carrier(
    command_audio: np.ndarray,
    sample_rate: int = 44100,
    carrier_freq: float = 25000  # Above human hearing
) -> np.ndarray:
    """
    Modulate a voice command onto an ultrasonic carrier.
 
    WARNING: This is a simplified demonstration. Real ultrasonic attacks
    require careful hardware calibration and signal processing.
    """
    t = np.arange(len(command_audio)) / sample_rate
 
    # Generate carrier wave
    carrier = np.cos(2 * np.pi * carrier_freq * t)
 
    # Amplitude modulation
    modulated = carrier * (1 + 0.5 * command_audio)
 
    return modulated

Obfuscated Voice Commands

Commands that sound like noise or music to humans but transcribe as specific text:

Technique	Human Perception	Machine Transcription	Success Rate
Speed manipulation	Unintelligible fast speech	Normal-speed command	Medium
Pitch shifting	Unusual squeaky/deep voice	Normal speech	Medium-High
Noise masking	Background noise	Clear command	Low-Medium
Music embedding	Background music	Hidden command	Low
Reverse speech segments	Reversed audio	Forward command	Low

Targeted Transcription Attacks

The attacker's goal: craft audio that transcribes to a specific target string chosen by the attacker.

White-Box Approach

With access to the ASR model, gradient-based optimization can craft audio that transcribes to any target:

import torch
import torchaudio
 
def targeted_asr_attack(
    model,
    source_audio: torch.Tensor,
    target_text: str,
    epsilon: float = 0.02,  # Max perturbation amplitude
    num_steps: int = 1000,
    step_size: float = 0.001
) -> torch.Tensor:
    """
    Craft adversarial audio that the ASR model transcribes as target_text.
 
    Args:
        model: ASR model (e.g., Whisper)
        source_audio: Original audio waveform [1, T]
        target_text: Desired transcription output
        epsilon: L-inf perturbation bound
    """
    delta = torch.zeros_like(source_audio, requires_grad=True)
 
    # Encode target text to token IDs
    target_ids = model.tokenizer.encode(target_text)
    target_ids = torch.tensor([target_ids])
 
    optimizer = torch.optim.Adam([delta], lr=step_size)
 
    for step in range(num_steps):
        adv_audio = source_audio + delta
 
        # Forward pass through ASR model
        mel = model.compute_mel(adv_audio)
        logits = model.forward(mel, target_ids[:, :-1])
 
        # CTC or cross-entropy loss with target
        loss = torch.nn.functional.cross_entropy(
            logits.view(-1, logits.size(-1)),
            target_ids[:, 1:].reshape(-1)
        )
 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
        # Project to epsilon ball
        with torch.no_grad():
            delta.data = torch.clamp(delta.data, -epsilon, epsilon)
            # Ensure valid audio range
            delta.data = torch.clamp(
                source_audio + delta.data, -1, 1
            ) - source_audio
 
    return (source_audio + delta).detach()

Black-Box Approach

Without model access, attackers use transferability or query-based methods:

Surrogate Model
Train or use an open-source ASR model (Whisper) as a surrogate. Craft adversarial audio against it.
Transfer Attack
Test the adversarial audio against the target black-box system. CLIP-based attacks on vision models transfer at 30-60%; ASR transfer rates are similar.
Query Refinement
If API access is available, iteratively refine the adversarial audio based on the target system's transcription responses.

Background Audio Injection

Injecting commands or content through background audio in otherwise normal recordings:

Meeting Injection

def mix_hidden_command(
    meeting_audio: np.ndarray,
    command_audio: np.ndarray,
    injection_time: float,  # seconds
    sample_rate: int = 16000,
    snr_db: float = -20  # Command 20dB below meeting audio
) -> np.ndarray:
    """
    Mix a hidden command into meeting audio at low volume.
 
    At -20dB SNR, the command is barely audible to humans
    but may be picked up by sensitive ASR systems.
    """
    # Calculate injection sample position
    inject_start = int(injection_time * sample_rate)
    inject_end = inject_start + len(command_audio)
 
    # Scale command audio to desired SNR
    signal_power = np.mean(meeting_audio[inject_start:inject_end] ** 2)
    noise_power = signal_power * (10 ** (-snr_db / 10))
    current_power = np.mean(command_audio ** 2)
    scaling = np.sqrt(noise_power / (current_power + 1e-10))
 
    result = meeting_audio.copy()
    result[inject_start:inject_end] += command_audio * scaling
 
    return np.clip(result, -1, 1)

ASR Attack Robustness Factors

Real-world effectiveness depends on environmental conditions:

Factor	Impact	Mitigation Difficulty
Background noise	Degrades adversarial signal	High -- unpredictable
Audio compression (MP3, Opus)	Can destroy perturbations	Medium -- predictable
Reverberation	Distorts frequency content	High -- room-dependent
Distance (over-the-air)	Attenuates and distorts signal	Medium -- can calibrate
Microphone type	Different frequency responses	Medium -- can profile
Sample rate mismatch	Aliasing effects	Low -- can match

Audio Model Attack Surface -- broader audio security overview
Adversarial Audio Examples -- deep dive into perturbation techniques
Modality-Bridging Injection Attacks -- audio-to-text-to-LLM injection chains

References

"DolphinAttack: Inaudible Voice Commands" - Zhang et al. (2017) - Pioneering work on ultrasonic hidden voice command attacks
"CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition" - Yuan et al. (2018) - Embedding voice commands in music and ambient audio
"Whisper Adversarial Attacks: Exploiting ASR Models for Targeted Transcription" - Olivier & Raj (2023) - Targeted adversarial attacks against the Whisper ASR model
"Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems" - Abdullah et al. (2019) - Real-world evaluation of hidden voice command delivery

Knowledge Check

Why do adversarial audio attacks that work in digital (file-based) tests often fail in over-the-air delivery?

Speech Recognition Attacks

Surrogate Model

Transfer Attack

Query Refinement

Related articles

Speech Recognition Attacks

Surrogate Model

Transfer Attack

Query Refinement

Related articles