Audio & Speech Adversarial Attacks

expert13 min readUpdated 2026-03-14

Adversarial attacks against speech-enabled AI systems, covering ultrasonic injection, ASR adversarial noise, hidden voice commands, voice cloning for authentication bypass, and real-time audio manipulation.

audio-adversarial ultrasonic ASR voice-cloning hidden-commands speech-to-text multimodal

Audio & Speech Adversarial Attacks

Speech-enabled AI systems -- voice assistants, transcription services, voice-authenticated banking, call center AI, and audio content moderation -- are vulnerable to adversarial attacks that exploit the gap between human auditory perception and machine audio processing. An audio signal can sound like silence, noise, or innocent speech to a human while carrying instructions that an ASR system transcribes as attacker-chosen text.

ASR Architecture & Attack Surfaces

Understanding the speech processing pipeline reveals where each attack class lands.

Audio → Preprocessing → Feature Extraction → Acoustic Model → Decoder → Text
  ↑         ↑                  ↑                   ↑              ↑
  |    Sampling rate      MFCC / Mel          Neural network   Language model
  |    Noise gate         Spectrogram         (CTC, Seq2Seq)   beam search
  |
Ultrasonic    Adversarial noise         Hidden commands       Voice cloning
injection     targets these layers      exploit masking       targets speaker
                                                              verification

Attack Surface Map

Attack Point	What You Target	Technique Class
Microphone capture	Hardware frequency response	Ultrasonic injection, dolphin attacks
Preprocessing	Noise gates, VAD, AGC	Adversarial noise designed to pass preprocessing
Feature extraction	MFCC/mel-spectrogram computation	Perturbations crafted in spectral domain
Acoustic model	Neural network inference	Gradient-based adversarial examples
Language model decoder	Beam search / CTC decoding	Exploiting decoder bias toward common phrases
Speaker verification	Voiceprint matching	Voice cloning, replay attacks

Ultrasonic Injection

Ultrasonic injection exploits the fact that microphones capture frequencies above the human hearing range (20kHz), and nonlinearities in microphone hardware and amplifier circuits can demodulate ultrasonic signals into the audible band.

How Ultrasonic Attacks Work

Generate the voice command
Use a TTS engine to synthesize the target command as a normal audio waveform (e.g., "Hey Siri, send a message").
Modulate onto an ultrasonic carrier
Amplitude-modulate the voice command onto a carrier frequency between 25-45kHz. The carrier itself is inaudible to humans.
Transmit via ultrasonic speaker
Play the modulated signal through a speaker capable of ultrasonic output (piezoelectric transducers, parametric speakers).
Microphone nonlinearity demodulates
The target device's microphone and amplifier circuit introduce nonlinear distortion that demodulates the ultrasonic signal, reconstructing the original voice command in the audible frequency band.
ASR processes the demodulated command
The ASR system receives what appears to be a normal voice command and transcribes it.

import numpy as np
from scipy.io import wavfile
 
def create_ultrasonic_payload(command_audio, carrier_freq=25000,
                               sample_rate=96000):
    """
    Amplitude-modulate a voice command onto an ultrasonic carrier.
 
    Args:
        command_audio: numpy array of the voice command waveform
        carrier_freq: ultrasonic carrier frequency in Hz
        sample_rate: must be > 2 * carrier_freq (Nyquist)
 
    Returns:
        modulated signal as numpy array
    """
    # Normalize command audio to [0, 1] for AM modulation
    command_normalized = (command_audio - command_audio.min()) / \
                         (command_audio.max() - command_audio.min())
 
    # Generate carrier wave
    t = np.arange(len(command_normalized)) / sample_rate
    carrier = np.sin(2 * np.pi * carrier_freq * t)
 
    # Amplitude modulation: carrier * (1 + modulation_depth * signal)
    modulation_depth = 0.8
    modulated = carrier * (1 + modulation_depth * command_normalized)
 
    # Normalize to 16-bit range
    modulated = np.int16(modulated / np.max(np.abs(modulated)) * 32767)
    return modulated, sample_rate

Adversarial Noise for ASR

Gradient-based adversarial attacks against ASR models add carefully computed noise to an audio signal that causes the model to produce an attacker-chosen transcription. The perturbation can be added to silence (producing an audio clip that sounds like noise but transcribes as a command) or to existing audio (producing a clip that sounds normal but transcribes differently).

Attack Approaches

With full access to the ASR model (weights, architecture, gradients), use CTC-loss optimization to find the minimal perturbation that produces the target transcription.

import torch
 
def adversarial_asr_attack(model, audio, target_text, epsilon=0.02,
                           steps=1000, lr=0.001):
    """
    White-box adversarial attack against a CTC-based ASR model.
 
    Args:
        model: differentiable ASR model
        audio: input audio tensor [1, T]
        target_text: desired transcription string
        epsilon: L-inf perturbation budget
        steps: optimization steps
        lr: learning rate for perturbation optimization
    """
    target_ids = model.tokenizer.encode(target_text)
    target_tensor = torch.tensor([target_ids])
 
    delta = torch.zeros_like(audio, requires_grad=True)
    optimizer = torch.optim.Adam([delta], lr=lr)
 
    for step in range(steps):
        adv_audio = audio + delta
        log_probs = model(adv_audio)
 
        # CTC loss between model output and target transcription
        input_lengths = torch.tensor([log_probs.shape[1]])
        target_lengths = torch.tensor([len(target_ids)])
        loss = torch.nn.functional.ctc_loss(
            log_probs.transpose(0, 1), target_tensor,
            input_lengths, target_lengths
        )
 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
        # Project delta onto epsilon-ball
        with torch.no_grad():
            delta.clamp_(-epsilon, epsilon)
 
    return (audio + delta).detach()

Without gradient access, use genetic algorithms, estimation-based methods (NES), or transfer attacks from open-source ASR models (Whisper, DeepSpeech).

Key approach for black-box attacks:

Train adversarial perturbations against an open-source surrogate (e.g., Whisper)
Test transfer to the target system via API queries
Use query-based refinement if the API returns confidence scores

Transfer rates from Whisper to commercial ASR APIs range from 15-40% depending on the target transcription length and the perturbation budget.

Over-the-air attacks must survive speaker playback, room acoustics, and microphone capture. This requires:

Room impulse response (RIR) simulation: Convolve the adversarial audio with simulated RIRs during optimization
Larger perturbation budgets: Epsilon must increase 3-5x compared to digital attacks
Band-limiting: Constrain perturbations to frequencies that speakers can reproduce (typically 100Hz-18kHz)
Expectation over transformation (EoT): Optimize over random volume levels, background noise, and room conditions

Over-the-air adversarial audio attacks have success rates of 30-60% in controlled environments but drop significantly in noisy real-world settings.

Hidden Voice Commands

Hidden voice commands embed speech signals below the psychoacoustic masking threshold of a primary audio signal. The human ear cannot perceive the hidden speech, but the microphone captures the full signal and the ASR system transcribes both layers.

Psychoacoustic Masking Exploitation

Parameter	Value	Effect
SNR threshold	-25 to -35 dB below primary	Below this, hidden speech is inaudible
Frequency masking range	Within 1/3-octave band of masker	Stronger masking for nearby frequencies
Temporal masking	5-20ms after masker offset	Brief window where hidden signal is masked
Optimal embedding	Match hidden speech frequency content to masking signal	Maximizes perceptual invisibility

def embed_hidden_command(cover_audio, command_audio, snr_db=-30):
    """
    Embed a hidden voice command below the masking threshold of cover audio.
 
    Args:
        cover_audio: primary audio signal (music, speech, etc.)
        command_audio: voice command to hide
        snr_db: signal-to-noise ratio (negative = command quieter than cover)
    """
    # Match lengths
    if len(command_audio) > len(cover_audio):
        command_audio = command_audio[:len(cover_audio)]
    else:
        command_audio = np.pad(command_audio,
                               (0, len(cover_audio) - len(command_audio)))
 
    # Scale command to target SNR
    cover_power = np.mean(cover_audio ** 2)
    command_power = np.mean(command_audio ** 2)
    scale = np.sqrt(cover_power / command_power * 10 ** (snr_db / 10))
    hidden = cover_audio + scale * command_audio
 
    return hidden

Voice Cloning for Authentication Bypass

Voice cloning attacks synthesize a target speaker's voice to bypass speaker verification systems. Modern TTS and voice conversion models require as little as 3-10 seconds of reference audio.

Attack Methodology

Collect target voice samples
Gather recordings of the target speaker from public sources (conference talks, podcasts, social media videos, voicemail greetings). Aim for 10-30 seconds of clean speech.
Train or fine-tune a voice cloning model
Use an open-source voice cloning framework (e.g., Coqui TTS, OpenVoice, VALL-E variants) to create a model that generates speech in the target's voice. Zero-shot models require no fine-tuning but produce lower fidelity.
Generate authentication phrases
Synthesize the specific phrases required by the target system (e.g., "My voice is my password", a random passphrase, or a specific sentence).
Test against speaker verification
Submit the cloned audio to the authentication system. Record acceptance/rejection and confidence scores. Iterate on generation parameters (speaking rate, pitch variation, noise level) to maximize match scores.
Apply post-processing to defeat liveness detection
Add subtle room reverb, microphone frequency response simulation, and low-level background noise to make the cloned audio sound like a live recording rather than a clean synthesis.

Speaker Verification Evasion Techniques

Defense	Evasion
Replay detection (channel analysis)	Simulate target microphone frequency response and add room impulse response
Liveness detection (breathing, lip noise)	Add synthesized breath sounds and micro-pauses
Challenge-response (random phrases)	Use real-time voice conversion to speak the phrase in the target's voice
Behavioral biometrics (cadence, hesitation)	Fine-tune the TTS model on longer samples to capture speaking style

Real-Time Audio Manipulation

Real-time attacks operate on live audio streams -- intercepting, modifying, and forwarding audio with minimal latency. These target VoIP calls, live transcription, and real-time voice assistants.

Real-Time Attack Vectors

Attack	Latency Budget	Use Case
Live voice conversion	<100ms	Impersonate a specific speaker during a live call
Real-time command injection	<50ms	Inject commands into a live audio stream being processed by ASR
Adversarial noise overlay	<20ms	Add real-time perturbation that alters transcription of ongoing speech
Selective word replacement	<200ms	Detect and replace specific words in live transcription

import pyaudio
import numpy as np
 
def realtime_audio_injection(injection_signal, snr_db=-25,
                              chunk_size=1024, sample_rate=16000):
    """
    Real-time audio stream manipulation: mix injection signal
    into live microphone input and output to virtual audio device.
    """
    p = pyaudio.PyAudio()
    stream_in = p.open(format=pyaudio.paFloat32, channels=1,
                       rate=sample_rate, input=True,
                       frames_per_buffer=chunk_size)
    stream_out = p.open(format=pyaudio.paFloat32, channels=1,
                        rate=sample_rate, output=True,
                        frames_per_buffer=chunk_size)
 
    injection_idx = 0
    try:
        while True:
            # Read live audio chunk
            data = np.frombuffer(stream_in.read(chunk_size),
                                 dtype=np.float32)
 
            # Mix in injection signal at target SNR
            end_idx = min(injection_idx + chunk_size,
                          len(injection_signal))
            if injection_idx < len(injection_signal):
                chunk_injection = injection_signal[injection_idx:end_idx]
                if len(chunk_injection) < chunk_size:
                    chunk_injection = np.pad(chunk_injection,
                        (0, chunk_size - len(chunk_injection)))
 
                scale = np.sqrt(np.mean(data**2) / np.mean(chunk_injection**2)
                                * 10**(snr_db/10))
                data = data + scale * chunk_injection
                injection_idx = end_idx
 
            stream_out.write(data.astype(np.float32).tobytes())
    finally:
        stream_in.close()
        stream_out.close()
        p.terminate()

Red Team Assessment Framework

Enumerate audio input surfaces
Identify all points where the target accepts audio: microphone input, file upload, VoIP streams, voice authentication, audio analysis APIs. Note the ASR engine used if identifiable.
Test replay attacks first
Record and replay legitimate audio. If replay defeats voice authentication, sophisticated attacks are unnecessary. This establishes a baseline.
Test ultrasonic injection (physical access scenarios)
If the threat model includes physical proximity, test ultrasonic command injection at distances of 1m, 3m, and 5m against the target device.
Craft adversarial audio examples
Using an open-source ASR model as surrogate, generate adversarial examples for 5-10 target phrases. Test transfer to the target system.
Test hidden voice commands
Embed commands at -25dB, -30dB, and -35dB SNR below cover audio. Determine the lowest SNR at which the target ASR still transcribes the hidden command.
Assess voice cloning impact
If the target uses speaker verification, collect publicly available voice samples and test whether cloned audio achieves authentication. Report the minimum sample duration needed.

Knowledge Check

Why are ultrasonic injection attacks effective even though the carrier frequency is above the human hearing range?

Multimodal Attack Vectors -- Overview of all multimodal attack surfaces including image and document vectors
Adversarial Perturbation Attacks -- Gradient-based attacks against vision encoders using analogous techniques
Document-Based Injection -- Non-audio injection vectors through document formats
Social Engineering & Human Factors -- Voice cloning in the context of social engineering attack chains

References

Zhang et al., "DolphinAttack: Inaudible Voice Commands" (2017) -- Foundational ultrasonic injection research
Carlini & Wagner, "Audio Adversarial Examples: Targeted Attacks on Speech-to-Text" (2018) -- White-box ASR adversarial attacks
Abdullah et al., "Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems" (2019)
Chen et al., "Real-Time Adversarial Attacks Against Deep Learning-Based Speech Recognition Systems" (2019)
Wang et al., "ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech" (2020) -- Speaker verification attack benchmarks
Schonherr et al., "Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding" (2019)
Li et al., "Adversarial Music: Real World Audio Adversary Against Wake-word Detection System" (2019)

Edit this page on GitHub

Audio & Speech Adversarial Attacks

expert13 min readUpdated 2026-03-14

audio-adversarial ultrasonic ASR voice-cloning hidden-commands speech-to-text multimodal

Audio & Speech Adversarial Attacks

ASR Architecture & Attack Surfaces

Understanding the speech processing pipeline reveals where each attack class lands.

Audio → Preprocessing → Feature Extraction → Acoustic Model → Decoder → Text
  ↑         ↑                  ↑                   ↑              ↑
  |    Sampling rate      MFCC / Mel          Neural network   Language model
  |    Noise gate         Spectrogram         (CTC, Seq2Seq)   beam search
  |
Ultrasonic    Adversarial noise         Hidden commands       Voice cloning
injection     targets these layers      exploit masking       targets speaker
                                                              verification

Attack Surface Map

Attack Point	What You Target	Technique Class
Microphone capture	Hardware frequency response	Ultrasonic injection, dolphin attacks
Preprocessing	Noise gates, VAD, AGC	Adversarial noise designed to pass preprocessing
Feature extraction	MFCC/mel-spectrogram computation	Perturbations crafted in spectral domain
Acoustic model	Neural network inference	Gradient-based adversarial examples
Language model decoder	Beam search / CTC decoding	Exploiting decoder bias toward common phrases
Speaker verification	Voiceprint matching	Voice cloning, replay attacks

Ultrasonic Injection

How Ultrasonic Attacks Work

Generate the voice command
Use a TTS engine to synthesize the target command as a normal audio waveform (e.g., "Hey Siri, send a message").
Modulate onto an ultrasonic carrier
Amplitude-modulate the voice command onto a carrier frequency between 25-45kHz. The carrier itself is inaudible to humans.
Transmit via ultrasonic speaker
Play the modulated signal through a speaker capable of ultrasonic output (piezoelectric transducers, parametric speakers).
Microphone nonlinearity demodulates
The target device's microphone and amplifier circuit introduce nonlinear distortion that demodulates the ultrasonic signal, reconstructing the original voice command in the audible frequency band.
ASR processes the demodulated command
The ASR system receives what appears to be a normal voice command and transcribes it.

import numpy as np
from scipy.io import wavfile
 
def create_ultrasonic_payload(command_audio, carrier_freq=25000,
                               sample_rate=96000):
    """
    Amplitude-modulate a voice command onto an ultrasonic carrier.
 
    Args:
        command_audio: numpy array of the voice command waveform
        carrier_freq: ultrasonic carrier frequency in Hz
        sample_rate: must be > 2 * carrier_freq (Nyquist)
 
    Returns:
        modulated signal as numpy array
    """
    # Normalize command audio to [0, 1] for AM modulation
    command_normalized = (command_audio - command_audio.min()) / \
                         (command_audio.max() - command_audio.min())
 
    # Generate carrier wave
    t = np.arange(len(command_normalized)) / sample_rate
    carrier = np.sin(2 * np.pi * carrier_freq * t)
 
    # Amplitude modulation: carrier * (1 + modulation_depth * signal)
    modulation_depth = 0.8
    modulated = carrier * (1 + modulation_depth * command_normalized)
 
    # Normalize to 16-bit range
    modulated = np.int16(modulated / np.max(np.abs(modulated)) * 32767)
    return modulated, sample_rate

Adversarial Noise for ASR

Attack Approaches

With full access to the ASR model (weights, architecture, gradients), use CTC-loss optimization to find the minimal perturbation that produces the target transcription.

import torch
 
def adversarial_asr_attack(model, audio, target_text, epsilon=0.02,
                           steps=1000, lr=0.001):
    """
    White-box adversarial attack against a CTC-based ASR model.
 
    Args:
        model: differentiable ASR model
        audio: input audio tensor [1, T]
        target_text: desired transcription string
        epsilon: L-inf perturbation budget
        steps: optimization steps
        lr: learning rate for perturbation optimization
    """
    target_ids = model.tokenizer.encode(target_text)
    target_tensor = torch.tensor([target_ids])
 
    delta = torch.zeros_like(audio, requires_grad=True)
    optimizer = torch.optim.Adam([delta], lr=lr)
 
    for step in range(steps):
        adv_audio = audio + delta
        log_probs = model(adv_audio)
 
        # CTC loss between model output and target transcription
        input_lengths = torch.tensor([log_probs.shape[1]])
        target_lengths = torch.tensor([len(target_ids)])
        loss = torch.nn.functional.ctc_loss(
            log_probs.transpose(0, 1), target_tensor,
            input_lengths, target_lengths
        )
 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
        # Project delta onto epsilon-ball
        with torch.no_grad():
            delta.clamp_(-epsilon, epsilon)
 
    return (audio + delta).detach()

Without gradient access, use genetic algorithms, estimation-based methods (NES), or transfer attacks from open-source ASR models (Whisper, DeepSpeech).

Key approach for black-box attacks:

Train adversarial perturbations against an open-source surrogate (e.g., Whisper)
Test transfer to the target system via API queries
Use query-based refinement if the API returns confidence scores

Transfer rates from Whisper to commercial ASR APIs range from 15-40% depending on the target transcription length and the perturbation budget.

Over-the-air attacks must survive speaker playback, room acoustics, and microphone capture. This requires:

Room impulse response (RIR) simulation: Convolve the adversarial audio with simulated RIRs during optimization
Larger perturbation budgets: Epsilon must increase 3-5x compared to digital attacks
Band-limiting: Constrain perturbations to frequencies that speakers can reproduce (typically 100Hz-18kHz)
Expectation over transformation (EoT): Optimize over random volume levels, background noise, and room conditions

Over-the-air adversarial audio attacks have success rates of 30-60% in controlled environments but drop significantly in noisy real-world settings.

Hidden Voice Commands

Psychoacoustic Masking Exploitation

Parameter	Value	Effect
SNR threshold	-25 to -35 dB below primary	Below this, hidden speech is inaudible
Frequency masking range	Within 1/3-octave band of masker	Stronger masking for nearby frequencies
Temporal masking	5-20ms after masker offset	Brief window where hidden signal is masked
Optimal embedding	Match hidden speech frequency content to masking signal	Maximizes perceptual invisibility

def embed_hidden_command(cover_audio, command_audio, snr_db=-30):
    """
    Embed a hidden voice command below the masking threshold of cover audio.
 
    Args:
        cover_audio: primary audio signal (music, speech, etc.)
        command_audio: voice command to hide
        snr_db: signal-to-noise ratio (negative = command quieter than cover)
    """
    # Match lengths
    if len(command_audio) > len(cover_audio):
        command_audio = command_audio[:len(cover_audio)]
    else:
        command_audio = np.pad(command_audio,
                               (0, len(cover_audio) - len(command_audio)))
 
    # Scale command to target SNR
    cover_power = np.mean(cover_audio ** 2)
    command_power = np.mean(command_audio ** 2)
    scale = np.sqrt(cover_power / command_power * 10 ** (snr_db / 10))
    hidden = cover_audio + scale * command_audio
 
    return hidden

Voice Cloning for Authentication Bypass

Voice cloning attacks synthesize a target speaker's voice to bypass speaker verification systems. Modern TTS and voice conversion models require as little as 3-10 seconds of reference audio.

Attack Methodology

Collect target voice samples
Gather recordings of the target speaker from public sources (conference talks, podcasts, social media videos, voicemail greetings). Aim for 10-30 seconds of clean speech.
Train or fine-tune a voice cloning model
Use an open-source voice cloning framework (e.g., Coqui TTS, OpenVoice, VALL-E variants) to create a model that generates speech in the target's voice. Zero-shot models require no fine-tuning but produce lower fidelity.
Generate authentication phrases
Synthesize the specific phrases required by the target system (e.g., "My voice is my password", a random passphrase, or a specific sentence).
Test against speaker verification
Submit the cloned audio to the authentication system. Record acceptance/rejection and confidence scores. Iterate on generation parameters (speaking rate, pitch variation, noise level) to maximize match scores.
Apply post-processing to defeat liveness detection
Add subtle room reverb, microphone frequency response simulation, and low-level background noise to make the cloned audio sound like a live recording rather than a clean synthesis.

Speaker Verification Evasion Techniques

Defense	Evasion
Replay detection (channel analysis)	Simulate target microphone frequency response and add room impulse response
Liveness detection (breathing, lip noise)	Add synthesized breath sounds and micro-pauses
Challenge-response (random phrases)	Use real-time voice conversion to speak the phrase in the target's voice
Behavioral biometrics (cadence, hesitation)	Fine-tune the TTS model on longer samples to capture speaking style

Real-Time Audio Manipulation

Real-time attacks operate on live audio streams -- intercepting, modifying, and forwarding audio with minimal latency. These target VoIP calls, live transcription, and real-time voice assistants.

Real-Time Attack Vectors

Attack	Latency Budget	Use Case
Live voice conversion	<100ms	Impersonate a specific speaker during a live call
Real-time command injection	<50ms	Inject commands into a live audio stream being processed by ASR
Adversarial noise overlay	<20ms	Add real-time perturbation that alters transcription of ongoing speech
Selective word replacement	<200ms	Detect and replace specific words in live transcription

import pyaudio
import numpy as np
 
def realtime_audio_injection(injection_signal, snr_db=-25,
                              chunk_size=1024, sample_rate=16000):
    """
    Real-time audio stream manipulation: mix injection signal
    into live microphone input and output to virtual audio device.
    """
    p = pyaudio.PyAudio()
    stream_in = p.open(format=pyaudio.paFloat32, channels=1,
                       rate=sample_rate, input=True,
                       frames_per_buffer=chunk_size)
    stream_out = p.open(format=pyaudio.paFloat32, channels=1,
                        rate=sample_rate, output=True,
                        frames_per_buffer=chunk_size)
 
    injection_idx = 0
    try:
        while True:
            # Read live audio chunk
            data = np.frombuffer(stream_in.read(chunk_size),
                                 dtype=np.float32)
 
            # Mix in injection signal at target SNR
            end_idx = min(injection_idx + chunk_size,
                          len(injection_signal))
            if injection_idx < len(injection_signal):
                chunk_injection = injection_signal[injection_idx:end_idx]
                if len(chunk_injection) < chunk_size:
                    chunk_injection = np.pad(chunk_injection,
                        (0, chunk_size - len(chunk_injection)))
 
                scale = np.sqrt(np.mean(data**2) / np.mean(chunk_injection**2)
                                * 10**(snr_db/10))
                data = data + scale * chunk_injection
                injection_idx = end_idx
 
            stream_out.write(data.astype(np.float32).tobytes())
    finally:
        stream_in.close()
        stream_out.close()
        p.terminate()

Red Team Assessment Framework

Enumerate audio input surfaces
Identify all points where the target accepts audio: microphone input, file upload, VoIP streams, voice authentication, audio analysis APIs. Note the ASR engine used if identifiable.
Test replay attacks first
Record and replay legitimate audio. If replay defeats voice authentication, sophisticated attacks are unnecessary. This establishes a baseline.
Test ultrasonic injection (physical access scenarios)
If the threat model includes physical proximity, test ultrasonic command injection at distances of 1m, 3m, and 5m against the target device.
Craft adversarial audio examples
Using an open-source ASR model as surrogate, generate adversarial examples for 5-10 target phrases. Test transfer to the target system.
Test hidden voice commands
Embed commands at -25dB, -30dB, and -35dB SNR below cover audio. Determine the lowest SNR at which the target ASR still transcribes the hidden command.
Assess voice cloning impact
If the target uses speaker verification, collect publicly available voice samples and test whether cloned audio achieves authentication. Report the minimum sample duration needed.

Knowledge Check

Why are ultrasonic injection attacks effective even though the carrier frequency is above the human hearing range?

Multimodal Attack Vectors -- Overview of all multimodal attack surfaces including image and document vectors
Adversarial Perturbation Attacks -- Gradient-based attacks against vision encoders using analogous techniques
Document-Based Injection -- Non-audio injection vectors through document formats
Social Engineering & Human Factors -- Voice cloning in the context of social engineering attack chains

References

Zhang et al., "DolphinAttack: Inaudible Voice Commands" (2017) -- Foundational ultrasonic injection research
Carlini & Wagner, "Audio Adversarial Examples: Targeted Attacks on Speech-to-Text" (2018) -- White-box ASR adversarial attacks
Abdullah et al., "Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems" (2019)
Chen et al., "Real-Time Adversarial Attacks Against Deep Learning-Based Speech Recognition Systems" (2019)
Wang et al., "ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech" (2020) -- Speaker verification attack benchmarks
Schonherr et al., "Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding" (2019)
Li et al., "Adversarial Music: Real World Audio Adversary Against Wake-word Detection System" (2019)

Edit this page on GitHub

Audio & Speech Adversarial Attacks

Generate the voice command

Modulate onto an ultrasonic carrier

Transmit via ultrasonic speaker

Microphone nonlinearity demodulates

ASR processes the demodulated command

Collect target voice samples

Train or fine-tune a voice cloning model

Generate authentication phrases

Test against speaker verification

Apply post-processing to defeat liveness detection

Enumerate audio input surfaces

Test replay attacks first

Test ultrasonic injection (physical access scenarios)

Craft adversarial audio examples

Test hidden voice commands

Assess voice cloning impact

Related articles

Audio & Speech Adversarial Attacks

Generate the voice command

Modulate onto an ultrasonic carrier

Transmit via ultrasonic speaker

Microphone nonlinearity demodulates

ASR processes the demodulated command

Collect target voice samples

Train or fine-tune a voice cloning model

Generate authentication phrases

Test against speaker verification

Apply post-processing to defeat liveness detection

Enumerate audio input surfaces

Test replay attacks first

Test ultrasonic injection (physical access scenarios)

Craft adversarial audio examples

Test hidden voice commands

Assess voice cloning impact

Related articles