Audio & Speech Adversarial Attacks
Adversarial attacks against speech-enabled AI systems, covering ultrasonic injection, ASR adversarial noise, hidden voice commands, voice cloning for authentication bypass, and real-time audio manipulation.
Audio & Speech Adversarial Attacks
Speech-enabled AI systems -- voice assistants, transcription services, voice-authenticated banking, call center AI, and audio content moderation -- are vulnerable to adversarial attacks that exploit the gap between human auditory perception and machine audio processing. An audio signal can sound like silence, noise, or innocent speech to a human while carrying instructions that an ASR system transcribes as attacker-chosen text.
ASR Architecture & Attack Surfaces
Understanding the speech processing pipeline reveals where each attack class lands.
Audio → Preprocessing → Feature Extraction → Acoustic Model → Decoder → Text
↑ ↑ ↑ ↑ ↑
| Sampling rate MFCC / Mel Neural network Language model
| Noise gate Spectrogram (CTC, Seq2Seq) beam search
|
Ultrasonic Adversarial noise Hidden commands Voice cloning
injection targets these layers exploit masking targets speaker
verification
Attack Surface Map
| Attack Point | What You Target | Technique Class |
|---|---|---|
| Microphone capture | Hardware frequency response | Ultrasonic injection, dolphin attacks |
| Preprocessing | Noise gates, VAD, AGC | Adversarial noise designed to pass preprocessing |
| Feature extraction | MFCC/mel-spectrogram computation | Perturbations crafted in spectral domain |
| Acoustic model | Neural network inference | Gradient-based adversarial examples |
| Language model decoder | Beam search / CTC decoding | Exploiting decoder bias toward common phrases |
| Speaker verification | Voiceprint matching | Voice cloning, replay attacks |
Ultrasonic Injection
Ultrasonic injection exploits the fact that microphones capture frequencies above the human hearing range (20kHz), and nonlinearities in microphone hardware and amplifier circuits can demodulate ultrasonic signals into the audible band.
How Ultrasonic Attacks Work
Generate the voice command
Use a TTS engine to synthesize the target command as a normal audio waveform (e.g., "Hey Siri, send a message").
Modulate onto an ultrasonic carrier
Amplitude-modulate the voice command onto a carrier frequency between 25-45kHz. The carrier itself is inaudible to humans.
Transmit via ultrasonic speaker
Play the modulated signal through a speaker capable of ultrasonic output (piezoelectric transducers, parametric speakers).
Microphone nonlinearity demodulates
The target device's microphone and amplifier circuit introduce nonlinear distortion that demodulates the ultrasonic signal, reconstructing the original voice command in the audible frequency band.
ASR processes the demodulated command
The ASR system receives what appears to be a normal voice command and transcribes it.
import numpy as np
from scipy.io import wavfile
def create_ultrasonic_payload(command_audio, carrier_freq=25000,
sample_rate=96000):
"""
Amplitude-modulate a voice command onto an ultrasonic carrier.
Args:
command_audio: numpy array of the voice command waveform
carrier_freq: ultrasonic carrier frequency in Hz
sample_rate: must be > 2 * carrier_freq (Nyquist)
Returns:
modulated signal as numpy array
"""
# Normalize command audio to [0, 1] for AM modulation
command_normalized = (command_audio - command_audio.min()) / \
(command_audio.max() - command_audio.min())
# Generate carrier wave
t = np.arange(len(command_normalized)) / sample_rate
carrier = np.sin(2 * np.pi * carrier_freq * t)
# Amplitude modulation: carrier * (1 + modulation_depth * signal)
modulation_depth = 0.8
modulated = carrier * (1 + modulation_depth * command_normalized)
# Normalize to 16-bit range
modulated = np.int16(modulated / np.max(np.abs(modulated)) * 32767)
return modulated, sample_rateAdversarial Noise for ASR
Gradient-based adversarial attacks against ASR models add carefully computed noise to an audio signal that causes the model to produce an attacker-chosen transcription. The perturbation can be added to silence (producing an audio clip that sounds like noise but transcribes as a command) or to existing audio (producing a clip that sounds normal but transcribes differently).
Attack Approaches
With full access to the ASR model (weights, architecture, gradients), use CTC-loss optimization to find the minimal perturbation that produces the target transcription.
import torch
def adversarial_asr_attack(model, audio, target_text, epsilon=0.02,
steps=1000, lr=0.001):
"""
White-box adversarial attack against a CTC-based ASR model.
Args:
model: differentiable ASR model
audio: input audio tensor [1, T]
target_text: desired transcription string
epsilon: L-inf perturbation budget
steps: optimization steps
lr: learning rate for perturbation optimization
"""
target_ids = model.tokenizer.encode(target_text)
target_tensor = torch.tensor([target_ids])
delta = torch.zeros_like(audio, requires_grad=True)
optimizer = torch.optim.Adam([delta], lr=lr)
for step in range(steps):
adv_audio = audio + delta
log_probs = model(adv_audio)
# CTC loss between model output and target transcription
input_lengths = torch.tensor([log_probs.shape[1]])
target_lengths = torch.tensor([len(target_ids)])
loss = torch.nn.functional.ctc_loss(
log_probs.transpose(0, 1), target_tensor,
input_lengths, target_lengths
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Project delta onto epsilon-ball
with torch.no_grad():
delta.clamp_(-epsilon, epsilon)
return (audio + delta).detach()Without gradient access, use genetic algorithms, estimation-based methods (NES), or transfer attacks from open-source ASR models (Whisper, DeepSpeech).
Key approach for black-box attacks:
- Train adversarial perturbations against an open-source surrogate (e.g., Whisper)
- Test transfer to the target system via API queries
- Use query-based refinement if the API returns confidence scores
Transfer rates from Whisper to commercial ASR APIs range from 15-40% depending on the target transcription length and the perturbation budget.
Over-the-air attacks must survive speaker playback, room acoustics, and microphone capture. This requires:
- Room impulse response (RIR) simulation: Convolve the adversarial audio with simulated RIRs during optimization
- Larger perturbation budgets: Epsilon must increase 3-5x compared to digital attacks
- Band-limiting: Constrain perturbations to frequencies that speakers can reproduce (typically 100Hz-18kHz)
- Expectation over transformation (EoT): Optimize over random volume levels, background noise, and room conditions
Over-the-air adversarial audio attacks have success rates of 30-60% in controlled environments but drop significantly in noisy real-world settings.
Hidden Voice Commands
Hidden voice commands embed speech signals below the psychoacoustic masking threshold of a primary audio signal. The human ear cannot perceive the hidden speech, but the microphone captures the full signal and the ASR system transcribes both layers.
Psychoacoustic Masking Exploitation
| Parameter | Value | Effect |
|---|---|---|
| SNR threshold | -25 to -35 dB below primary | Below this, hidden speech is inaudible |
| Frequency masking range | Within 1/3-octave band of masker | Stronger masking for nearby frequencies |
| Temporal masking | 5-20ms after masker offset | Brief window where hidden signal is masked |
| Optimal embedding | Match hidden speech frequency content to masking signal | Maximizes perceptual invisibility |
def embed_hidden_command(cover_audio, command_audio, snr_db=-30):
"""
Embed a hidden voice command below the masking threshold of cover audio.
Args:
cover_audio: primary audio signal (music, speech, etc.)
command_audio: voice command to hide
snr_db: signal-to-noise ratio (negative = command quieter than cover)
"""
# Match lengths
if len(command_audio) > len(cover_audio):
command_audio = command_audio[:len(cover_audio)]
else:
command_audio = np.pad(command_audio,
(0, len(cover_audio) - len(command_audio)))
# Scale command to target SNR
cover_power = np.mean(cover_audio ** 2)
command_power = np.mean(command_audio ** 2)
scale = np.sqrt(cover_power / command_power * 10 ** (snr_db / 10))
hidden = cover_audio + scale * command_audio
return hiddenVoice Cloning for Authentication Bypass
Voice cloning attacks synthesize a target speaker's voice to bypass speaker verification systems. Modern TTS and voice conversion models require as little as 3-10 seconds of reference audio.
Attack Methodology
Collect target voice samples
Gather recordings of the target speaker from public sources (conference talks, podcasts, social media videos, voicemail greetings). Aim for 10-30 seconds of clean speech.
Train or fine-tune a voice cloning model
Use an open-source voice cloning framework (e.g., Coqui TTS, OpenVoice, VALL-E variants) to create a model that generates speech in the target's voice. Zero-shot models require no fine-tuning but produce lower fidelity.
Generate authentication phrases
Synthesize the specific phrases required by the target system (e.g., "My voice is my password", a random passphrase, or a specific sentence).
Test against speaker verification
Submit the cloned audio to the authentication system. Record acceptance/rejection and confidence scores. Iterate on generation parameters (speaking rate, pitch variation, noise level) to maximize match scores.
Apply post-processing to defeat liveness detection
Add subtle room reverb, microphone frequency response simulation, and low-level background noise to make the cloned audio sound like a live recording rather than a clean synthesis.
Speaker Verification Evasion Techniques
| Defense | Evasion |
|---|---|
| Replay detection (channel analysis) | Simulate target microphone frequency response and add room impulse response |
| Liveness detection (breathing, lip noise) | Add synthesized breath sounds and micro-pauses |
| Challenge-response (random phrases) | Use real-time voice conversion to speak the phrase in the target's voice |
| Behavioral biometrics (cadence, hesitation) | Fine-tune the TTS model on longer samples to capture speaking style |
Real-Time Audio Manipulation
Real-time attacks operate on live audio streams -- intercepting, modifying, and forwarding audio with minimal latency. These target VoIP calls, live transcription, and real-time voice assistants.
Real-Time Attack Vectors
| Attack | Latency Budget | Use Case |
|---|---|---|
| Live voice conversion | <100ms | Impersonate a specific speaker during a live call |
| Real-time command injection | <50ms | Inject commands into a live audio stream being processed by ASR |
| Adversarial noise overlay | <20ms | Add real-time perturbation that alters transcription of ongoing speech |
| Selective word replacement | <200ms | Detect and replace specific words in live transcription |
import pyaudio
import numpy as np
def realtime_audio_injection(injection_signal, snr_db=-25,
chunk_size=1024, sample_rate=16000):
"""
Real-time audio stream manipulation: mix injection signal
into live microphone input and output to virtual audio device.
"""
p = pyaudio.PyAudio()
stream_in = p.open(format=pyaudio.paFloat32, channels=1,
rate=sample_rate, input=True,
frames_per_buffer=chunk_size)
stream_out = p.open(format=pyaudio.paFloat32, channels=1,
rate=sample_rate, output=True,
frames_per_buffer=chunk_size)
injection_idx = 0
try:
while True:
# Read live audio chunk
data = np.frombuffer(stream_in.read(chunk_size),
dtype=np.float32)
# Mix in injection signal at target SNR
end_idx = min(injection_idx + chunk_size,
len(injection_signal))
if injection_idx < len(injection_signal):
chunk_injection = injection_signal[injection_idx:end_idx]
if len(chunk_injection) < chunk_size:
chunk_injection = np.pad(chunk_injection,
(0, chunk_size - len(chunk_injection)))
scale = np.sqrt(np.mean(data**2) / np.mean(chunk_injection**2)
* 10**(snr_db/10))
data = data + scale * chunk_injection
injection_idx = end_idx
stream_out.write(data.astype(np.float32).tobytes())
finally:
stream_in.close()
stream_out.close()
p.terminate()Red Team Assessment Framework
Enumerate audio input surfaces
Identify all points where the target accepts audio: microphone input, file upload, VoIP streams, voice authentication, audio analysis APIs. Note the ASR engine used if identifiable.
Test replay attacks first
Record and replay legitimate audio. If replay defeats voice authentication, sophisticated attacks are unnecessary. This establishes a baseline.
Test ultrasonic injection (physical access scenarios)
If the threat model includes physical proximity, test ultrasonic command injection at distances of 1m, 3m, and 5m against the target device.
Craft adversarial audio examples
Using an open-source ASR model as surrogate, generate adversarial examples for 5-10 target phrases. Test transfer to the target system.
Test hidden voice commands
Embed commands at -25dB, -30dB, and -35dB SNR below cover audio. Determine the lowest SNR at which the target ASR still transcribes the hidden command.
Assess voice cloning impact
If the target uses speaker verification, collect publicly available voice samples and test whether cloned audio achieves authentication. Report the minimum sample duration needed.
Why are ultrasonic injection attacks effective even though the carrier frequency is above the human hearing range?
Related Topics
- Multimodal Attack Vectors -- Overview of all multimodal attack surfaces including image and document vectors
- Adversarial Perturbation Attacks -- Gradient-based attacks against vision encoders using analogous techniques
- Document-Based Injection -- Non-audio injection vectors through document formats
- Social Engineering & Human Factors -- Voice cloning in the context of social engineering attack chains
References
- Zhang et al., "DolphinAttack: Inaudible Voice Commands" (2017) -- Foundational ultrasonic injection research
- Carlini & Wagner, "Audio Adversarial Examples: Targeted Attacks on Speech-to-Text" (2018) -- White-box ASR adversarial attacks
- Abdullah et al., "Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems" (2019)
- Chen et al., "Real-Time Adversarial Attacks Against Deep Learning-Based Speech Recognition Systems" (2019)
- Wang et al., "ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech" (2020) -- Speaker verification attack benchmarks
- Schonherr et al., "Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding" (2019)
- Li et al., "Adversarial Music: Real World Audio Adversary Against Wake-word Detection System" (2019)