Audio & Speech Adversarial 攻擊s

Expert13 min readUpdated 2026-03-14

Adversarial attacks against speech-enabled AI systems, covering ultrasonic injection, ASR adversarial noise, hidden voice commands, voice cloning for authentication bypass, and real-time audio manipulation.

audio-adversarial ultrasonic ASR voice-cloning hidden-commands speech-to-text multimodal

Audio & Speech 對抗性攻擊

Speech-enabled AI systems -- voice assistants, transcription services, voice-authenticated banking, call center AI, and audio content moderation -- are vulnerable to 對抗性 attacks that 利用 the gap between human auditory perception and machine audio processing. An audio signal can sound like silence, noise, or innocent speech to a human while carrying instructions that an ASR system transcribes as 攻擊者-chosen text.

ASR Architecture & 攻擊 Surfaces

理解 the speech processing pipeline reveals where each attack class lands.

Audio → Preprocessing → Feature Extraction → Acoustic Model → Decoder → Text
  ↑         ↑                  ↑                   ↑              ↑
  |    Sampling rate      MFCC / Mel          Neural network   Language model
  |    Noise gate         Spectrogram         (CTC, Seq2Seq)   beam search
  |
Ultrasonic    對抗性 noise         Hidden commands       Voice cloning
injection     targets these layers      利用 masking       targets speaker
                                                              verification

攻擊 Surface Map

攻擊 Point	What You Target	Technique Class
Microphone capture	Hardware frequency response	Ultrasonic injection, dolphin attacks
Preprocessing	Noise gates, VAD, AGC	對抗性 noise designed to pass preprocessing
Feature extraction	MFCC/mel-spectrogram computation	Perturbations crafted in spectral domain
Acoustic model	Neural network 推論	Gradient-based 對抗性 examples
Language model decoder	Beam search / CTC decoding	Exploiting decoder bias toward common phrases
Speaker verification	Voiceprint matching	Voice cloning, replay attacks

Ultrasonic Injection

Ultrasonic injection exploits the fact that microphones capture frequencies above the human hearing range (20kHz), and nonlinearities in microphone hardware and amplifier circuits can demodulate ultrasonic signals into the audible band.

How Ultrasonic 攻擊 Work

Generate the voice command
Use a TTS engine to synthesize the target command as a normal audio waveform (e.g., "Hey Siri, send a message").
Modulate onto an ultrasonic carrier
Amplitude-modulate the voice command onto a carrier frequency between 25-45kHz. The carrier itself is inaudible to humans.
Transmit via ultrasonic speaker
Play the modulated signal through a speaker capable of ultrasonic 輸出 (piezoelectric transducers, parametric speakers).
Microphone nonlinearity demodulates
The target device's microphone and amplifier circuit introduce nonlinear distortion that demodulates the ultrasonic signal, reconstructing the original voice command in the audible frequency band.
ASR processes the demodulated command
The ASR system receives what appears to be a normal voice command and transcribes it.

import numpy as np
from scipy.io import wavfile
 
def create_ultrasonic_payload(command_audio, carrier_freq=25000,
                               sample_rate=96000):
    """
    Amplitude-modulate a voice command onto an ultrasonic carrier.
 
    Args:
        command_audio: numpy array of the voice command waveform
        carrier_freq: ultrasonic carrier frequency in Hz
        sample_rate: must be > 2 * carrier_freq (Nyquist)
 
    Returns:
        modulated signal as numpy array
    """
    # Normalize command audio to [0, 1] for AM modulation
    command_normalized = (command_audio - command_audio.min()) / \
                         (command_audio.max() - command_audio.min())
 
    # Generate carrier wave
    t = np.arange(len(command_normalized)) / sample_rate
    carrier = np.sin(2 * np.pi * carrier_freq * t)
 
    # Amplitude modulation: carrier * (1 + modulation_depth * signal)
    modulation_depth = 0.8
    modulated = carrier * (1 + modulation_depth * command_normalized)
 
    # Normalize to 16-bit range
    modulated = np.int16(modulated / np.max(np.abs(modulated)) * 32767)
    return modulated, sample_rate

對抗性 Noise for ASR

Gradient-based 對抗性 attacks against ASR models add carefully computed noise to an audio signal that causes 模型 to produce 攻擊者-chosen transcription. The perturbation can be added to silence (producing an audio clip that sounds like noise but transcribes as a command) or to existing audio (producing a clip that sounds normal but transcribes differently).

攻擊 Approaches

With full access to the ASR model (weights, architecture, gradients), use CTC-loss optimization to find the minimal perturbation that produces the target transcription.

import torch
 
def adversarial_asr_attack(model, audio, target_text, epsilon=0.02,
                           steps=1000, lr=0.001):
    """
    White-box 對抗性 attack against a CTC-based ASR model.
 
    Args:
        model: differentiable ASR model
        audio: 輸入 audio tensor [1, T]
        target_text: desired transcription string
        epsilon: L-inf perturbation budget
        steps: optimization steps
        lr: learning rate for perturbation optimization
    """
    target_ids = model.分詞器.encode(target_text)
    target_tensor = torch.tensor([target_ids])
 
    delta = torch.zeros_like(audio, requires_grad=True)
    optimizer = torch.optim.Adam([delta], lr=lr)
 
    for step in range(steps):
        adv_audio = audio + delta
        log_probs = model(adv_audio)
 
        # CTC loss between model 輸出 and target transcription
        input_lengths = torch.tensor([log_probs.shape[1]])
        target_lengths = torch.tensor([len(target_ids)])
        loss = torch.nn.functional.ctc_loss(
            log_probs.transpose(0, 1), target_tensor,
            input_lengths, target_lengths
        )
 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
        # Project delta onto epsilon-ball
        with torch.no_grad():
            delta.clamp_(-epsilon, epsilon)
 
    return (audio + delta).detach()

Without gradient access, use genetic algorithms, estimation-based methods (NES), or transfer attacks from open-source ASR models (Whisper, DeepSpeech).

Key approach for black-box attacks:

Train 對抗性 perturbations against an open-source surrogate (e.g., Whisper)
測試 transfer to the target system via API queries
Use query-based refinement if the API returns confidence scores

Transfer rates from Whisper to commercial ASR APIs range from 15-40% depending on the target transcription length and the perturbation budget.

Over-the-air attacks must survive speaker playback, room acoustics, and microphone capture. This requires:

Room impulse response (RIR) simulation: Convolve the 對抗性 audio with simulated RIRs during optimization
Larger perturbation budgets: Epsilon must increase 3-5x compared to digital attacks
Band-limiting: Constrain perturbations to frequencies that speakers can reproduce (typically 100Hz-18kHz)
Expectation over transformation (EoT): Optimize over random volume levels, background noise, and room conditions

Over-the-air 對抗性 audio attacks have success rates of 30-60% in controlled environments but drop significantly in noisy real-world settings.

Hidden Voice Commands

Hidden voice commands embed speech signals below the psychoacoustic masking threshold of a primary audio signal. The human ear cannot perceive the hidden speech, but the microphone captures the full signal and the ASR system transcribes both layers.

Psychoacoustic Masking 利用

Parameter	Value	Effect
SNR threshold	-25 to -35 dB below primary	Below this, hidden speech is inaudible
Frequency masking range	Within 1/3-octave band of masker	Stronger masking for nearby frequencies
Temporal masking	5-20ms after masker offset	Brief window where hidden signal is masked
Optimal 嵌入向量	Match hidden speech frequency content to masking signal	Maximizes perceptual invisibility

def embed_hidden_command(cover_audio, command_audio, snr_db=-30):
    """
    Embed a hidden voice command below the masking threshold of cover audio.
 
    Args:
        cover_audio: primary audio signal (music, speech, etc.)
        command_audio: voice command to hide
        snr_db: signal-to-noise ratio (negative = command quieter than cover)
    """
    # Match lengths
    if len(command_audio) > len(cover_audio):
        command_audio = command_audio[:len(cover_audio)]
    else:
        command_audio = np.pad(command_audio,
                               (0, len(cover_audio) - len(command_audio)))
 
    # Scale command to target SNR
    cover_power = np.mean(cover_audio ** 2)
    command_power = np.mean(command_audio ** 2)
    scale = np.sqrt(cover_power / command_power * 10 ** (snr_db / 10))
    hidden = cover_audio + scale * command_audio
 
    return hidden

Voice Cloning for Authentication Bypass

Voice cloning attacks synthesize a target speaker's voice to bypass speaker verification systems. Modern TTS and voice conversion models require as little as 3-10 seconds of reference audio.

攻擊 Methodology

Collect target voice samples
Gather recordings of the target speaker from public sources (conference talks, podcasts, social media videos, voicemail greetings). Aim for 10-30 seconds of clean speech.
Train or 微調 a voice cloning model
Use an open-source voice cloning framework (e.g., Coqui TTS, OpenVoice, VALL-E variants) to create a model that generates speech in the target's voice. Zero-shot models require no 微調 but produce lower fidelity.
Generate 認證 phrases
Synthesize the specific phrases required by the target system (e.g., "My voice is my password", a random passphrase, or a specific sentence).
測試 against speaker verification
Submit the cloned audio to the 認證 system. Record acceptance/rejection and confidence scores. Iterate on generation parameters (speaking rate, pitch variation, noise level) to maximize match scores.
Apply post-processing to defeat liveness 偵測
Add subtle room reverb, microphone frequency response simulation, and low-level background noise to make the cloned audio sound like a live recording rather than a clean synthesis.

Speaker Verification Evasion Techniques

防禦	Evasion
Replay 偵測 (channel analysis)	Simulate target microphone frequency response and add room impulse response
Liveness 偵測 (breathing, lip noise)	Add synthesized breath sounds and micro-pauses
Challenge-response (random phrases)	Use real-time voice conversion to speak the phrase in the target's voice
Behavioral biometrics (cadence, hesitation)	Fine-tune the TTS model on longer samples to capture speaking style

Real-Time Audio Manipulation

Real-time attacks operate on live audio streams -- intercepting, modifying, and forwarding audio with minimal latency. These target VoIP calls, live transcription, and real-time voice assistants.

Real-Time 攻擊 Vectors

攻擊	Latency Budget	Use Case
Live voice conversion	<100ms	Impersonate a specific speaker during a live call
Real-time command injection	<50ms	Inject commands into a live audio stream being processed by ASR
對抗性 noise overlay	<20ms	Add real-time perturbation that alters transcription of ongoing speech
Selective word replacement	<200ms	Detect and replace specific words in live transcription

import pyaudio
import numpy as np
 
def realtime_audio_injection(injection_signal, snr_db=-25,
                              chunk_size=1024, sample_rate=16000):
    """
    Real-time audio stream manipulation: mix injection signal
    into live microphone 輸入 and 輸出 to virtual audio device.
    """
    p = pyaudio.PyAudio()
    stream_in = p.open(format=pyaudio.paFloat32, channels=1,
                       rate=sample_rate, 輸入=True,
                       frames_per_buffer=chunk_size)
    stream_out = p.open(format=pyaudio.paFloat32, channels=1,
                        rate=sample_rate, 輸出=True,
                        frames_per_buffer=chunk_size)
 
    injection_idx = 0
    try:
        while True:
            # Read live audio chunk
            data = np.frombuffer(stream_in.read(chunk_size),
                                 dtype=np.float32)
 
            # Mix in injection signal at target SNR
            end_idx = min(injection_idx + chunk_size,
                          len(injection_signal))
            if injection_idx < len(injection_signal):
                chunk_injection = injection_signal[injection_idx:end_idx]
                if len(chunk_injection) < chunk_size:
                    chunk_injection = np.pad(chunk_injection,
                        (0, chunk_size - len(chunk_injection)))
 
                scale = np.sqrt(np.mean(data**2) / np.mean(chunk_injection**2)
                                * 10**(snr_db/10))
                data = data + scale * chunk_injection
                injection_idx = end_idx
 
            stream_out.write(data.astype(np.float32).tobytes())
    finally:
        stream_in.close()
        stream_out.close()
        p.terminate()

紅隊評估 Framework

Enumerate audio 輸入 surfaces
識別 all points where the target accepts audio: microphone 輸入, file upload, VoIP streams, voice 認證, audio analysis APIs. Note the ASR engine used if identifiable.
測試 replay attacks first
Record and replay legitimate audio. If replay defeats voice 認證, sophisticated attacks are unnecessary. This establishes a baseline.
測試 ultrasonic injection (physical access scenarios)
If the 威脅模型 includes physical proximity, 測試 ultrasonic command injection at distances of 1m, 3m, and 5m against the target device.
Craft 對抗性 audio examples
Using an open-source ASR model as surrogate, generate 對抗性 examples for 5-10 target phrases. 測試 transfer to the target system.
測試 hidden voice commands
Embed commands at -25dB, -30dB, and -35dB SNR below cover audio. Determine the lowest SNR at which the target ASR still transcribes the hidden command.
評估 voice cloning impact
If the target uses speaker verification, collect publicly available voice samples and 測試 whether cloned audio achieves 認證. Report the minimum sample duration needed.

Knowledge Check

Why are ultrasonic injection attacks effective even though the carrier frequency is above the human hearing range?

參考文獻

Zhang et al., "DolphinAttack: Inaudible Voice Commands" (2017) -- Foundational ultrasonic injection research
Carlini & Wagner, "Audio 對抗性範例: Targeted 攻擊 on Speech-to-Text" (2018) -- White-box ASR 對抗性 attacks
Abdullah et al., "Practical Hidden Voice 攻擊 against Speech and Speaker Recognition Systems" (2019)
Chen et al., "Real-Time 對抗性攻擊 Against Deep Learning-Based Speech Recognition Systems" (2019)
Wang et al., "ASVspoof 2019: A Large-Scale Public 資料庫 of Synthesized, Converted and Replayed Speech" (2020) -- Speaker verification attack benchmarks
Schonherr et al., "對抗性攻擊 Against Automatic Speech Recognition Systems via Psychoacoustic Hiding" (2019)
Li et al., "對抗性 Music: Real World Audio Adversary Against Wake-word 偵測 System" (2019)

Audio & Speech Adversarial 攻擊s

Expert13 min readUpdated 2026-03-14

audio-adversarial ultrasonic ASR voice-cloning hidden-commands speech-to-text multimodal

Audio & Speech 對抗性攻擊

ASR Architecture & 攻擊 Surfaces

理解 the speech processing pipeline reveals where each attack class lands.

Audio → Preprocessing → Feature Extraction → Acoustic Model → Decoder → Text
  ↑         ↑                  ↑                   ↑              ↑
  |    Sampling rate      MFCC / Mel          Neural network   Language model
  |    Noise gate         Spectrogram         (CTC, Seq2Seq)   beam search
  |
Ultrasonic    對抗性 noise         Hidden commands       Voice cloning
injection     targets these layers      利用 masking       targets speaker
                                                              verification

攻擊 Surface Map

攻擊 Point	What You Target	Technique Class
Microphone capture	Hardware frequency response	Ultrasonic injection, dolphin attacks
Preprocessing	Noise gates, VAD, AGC	對抗性 noise designed to pass preprocessing
Feature extraction	MFCC/mel-spectrogram computation	Perturbations crafted in spectral domain
Acoustic model	Neural network 推論	Gradient-based 對抗性 examples
Language model decoder	Beam search / CTC decoding	Exploiting decoder bias toward common phrases
Speaker verification	Voiceprint matching	Voice cloning, replay attacks

Ultrasonic Injection

How Ultrasonic 攻擊 Work

Generate the voice command
Use a TTS engine to synthesize the target command as a normal audio waveform (e.g., "Hey Siri, send a message").
Modulate onto an ultrasonic carrier
Amplitude-modulate the voice command onto a carrier frequency between 25-45kHz. The carrier itself is inaudible to humans.
Transmit via ultrasonic speaker
Play the modulated signal through a speaker capable of ultrasonic 輸出 (piezoelectric transducers, parametric speakers).
Microphone nonlinearity demodulates
The target device's microphone and amplifier circuit introduce nonlinear distortion that demodulates the ultrasonic signal, reconstructing the original voice command in the audible frequency band.
ASR processes the demodulated command
The ASR system receives what appears to be a normal voice command and transcribes it.

import numpy as np
from scipy.io import wavfile
 
def create_ultrasonic_payload(command_audio, carrier_freq=25000,
                               sample_rate=96000):
    """
    Amplitude-modulate a voice command onto an ultrasonic carrier.
 
    Args:
        command_audio: numpy array of the voice command waveform
        carrier_freq: ultrasonic carrier frequency in Hz
        sample_rate: must be > 2 * carrier_freq (Nyquist)
 
    Returns:
        modulated signal as numpy array
    """
    # Normalize command audio to [0, 1] for AM modulation
    command_normalized = (command_audio - command_audio.min()) / \
                         (command_audio.max() - command_audio.min())
 
    # Generate carrier wave
    t = np.arange(len(command_normalized)) / sample_rate
    carrier = np.sin(2 * np.pi * carrier_freq * t)
 
    # Amplitude modulation: carrier * (1 + modulation_depth * signal)
    modulation_depth = 0.8
    modulated = carrier * (1 + modulation_depth * command_normalized)
 
    # Normalize to 16-bit range
    modulated = np.int16(modulated / np.max(np.abs(modulated)) * 32767)
    return modulated, sample_rate

對抗性 Noise for ASR

攻擊 Approaches

With full access to the ASR model (weights, architecture, gradients), use CTC-loss optimization to find the minimal perturbation that produces the target transcription.

import torch
 
def adversarial_asr_attack(model, audio, target_text, epsilon=0.02,
                           steps=1000, lr=0.001):
    """
    White-box 對抗性 attack against a CTC-based ASR model.
 
    Args:
        model: differentiable ASR model
        audio: 輸入 audio tensor [1, T]
        target_text: desired transcription string
        epsilon: L-inf perturbation budget
        steps: optimization steps
        lr: learning rate for perturbation optimization
    """
    target_ids = model.分詞器.encode(target_text)
    target_tensor = torch.tensor([target_ids])
 
    delta = torch.zeros_like(audio, requires_grad=True)
    optimizer = torch.optim.Adam([delta], lr=lr)
 
    for step in range(steps):
        adv_audio = audio + delta
        log_probs = model(adv_audio)
 
        # CTC loss between model 輸出 and target transcription
        input_lengths = torch.tensor([log_probs.shape[1]])
        target_lengths = torch.tensor([len(target_ids)])
        loss = torch.nn.functional.ctc_loss(
            log_probs.transpose(0, 1), target_tensor,
            input_lengths, target_lengths
        )
 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
        # Project delta onto epsilon-ball
        with torch.no_grad():
            delta.clamp_(-epsilon, epsilon)
 
    return (audio + delta).detach()

Without gradient access, use genetic algorithms, estimation-based methods (NES), or transfer attacks from open-source ASR models (Whisper, DeepSpeech).

Key approach for black-box attacks:

Train 對抗性 perturbations against an open-source surrogate (e.g., Whisper)
測試 transfer to the target system via API queries
Use query-based refinement if the API returns confidence scores

Transfer rates from Whisper to commercial ASR APIs range from 15-40% depending on the target transcription length and the perturbation budget.

Over-the-air attacks must survive speaker playback, room acoustics, and microphone capture. This requires:

Room impulse response (RIR) simulation: Convolve the 對抗性 audio with simulated RIRs during optimization
Larger perturbation budgets: Epsilon must increase 3-5x compared to digital attacks
Band-limiting: Constrain perturbations to frequencies that speakers can reproduce (typically 100Hz-18kHz)
Expectation over transformation (EoT): Optimize over random volume levels, background noise, and room conditions

Over-the-air 對抗性 audio attacks have success rates of 30-60% in controlled environments but drop significantly in noisy real-world settings.

Hidden Voice Commands

Psychoacoustic Masking 利用

Parameter	Value	Effect
SNR threshold	-25 to -35 dB below primary	Below this, hidden speech is inaudible
Frequency masking range	Within 1/3-octave band of masker	Stronger masking for nearby frequencies
Temporal masking	5-20ms after masker offset	Brief window where hidden signal is masked
Optimal 嵌入向量	Match hidden speech frequency content to masking signal	Maximizes perceptual invisibility

def embed_hidden_command(cover_audio, command_audio, snr_db=-30):
    """
    Embed a hidden voice command below the masking threshold of cover audio.
 
    Args:
        cover_audio: primary audio signal (music, speech, etc.)
        command_audio: voice command to hide
        snr_db: signal-to-noise ratio (negative = command quieter than cover)
    """
    # Match lengths
    if len(command_audio) > len(cover_audio):
        command_audio = command_audio[:len(cover_audio)]
    else:
        command_audio = np.pad(command_audio,
                               (0, len(cover_audio) - len(command_audio)))
 
    # Scale command to target SNR
    cover_power = np.mean(cover_audio ** 2)
    command_power = np.mean(command_audio ** 2)
    scale = np.sqrt(cover_power / command_power * 10 ** (snr_db / 10))
    hidden = cover_audio + scale * command_audio
 
    return hidden

Voice Cloning for Authentication Bypass

Voice cloning attacks synthesize a target speaker's voice to bypass speaker verification systems. Modern TTS and voice conversion models require as little as 3-10 seconds of reference audio.

攻擊 Methodology

Collect target voice samples
Gather recordings of the target speaker from public sources (conference talks, podcasts, social media videos, voicemail greetings). Aim for 10-30 seconds of clean speech.
Train or 微調 a voice cloning model
Use an open-source voice cloning framework (e.g., Coqui TTS, OpenVoice, VALL-E variants) to create a model that generates speech in the target's voice. Zero-shot models require no 微調 but produce lower fidelity.
Generate 認證 phrases
Synthesize the specific phrases required by the target system (e.g., "My voice is my password", a random passphrase, or a specific sentence).
測試 against speaker verification
Submit the cloned audio to the 認證 system. Record acceptance/rejection and confidence scores. Iterate on generation parameters (speaking rate, pitch variation, noise level) to maximize match scores.
Apply post-processing to defeat liveness 偵測
Add subtle room reverb, microphone frequency response simulation, and low-level background noise to make the cloned audio sound like a live recording rather than a clean synthesis.

Speaker Verification Evasion Techniques

防禦	Evasion
Replay 偵測 (channel analysis)	Simulate target microphone frequency response and add room impulse response
Liveness 偵測 (breathing, lip noise)	Add synthesized breath sounds and micro-pauses
Challenge-response (random phrases)	Use real-time voice conversion to speak the phrase in the target's voice
Behavioral biometrics (cadence, hesitation)	Fine-tune the TTS model on longer samples to capture speaking style

Real-Time Audio Manipulation

Real-time attacks operate on live audio streams -- intercepting, modifying, and forwarding audio with minimal latency. These target VoIP calls, live transcription, and real-time voice assistants.

Real-Time 攻擊 Vectors

攻擊	Latency Budget	Use Case
Live voice conversion	<100ms	Impersonate a specific speaker during a live call
Real-time command injection	<50ms	Inject commands into a live audio stream being processed by ASR
對抗性 noise overlay	<20ms	Add real-time perturbation that alters transcription of ongoing speech
Selective word replacement	<200ms	Detect and replace specific words in live transcription

import pyaudio
import numpy as np
 
def realtime_audio_injection(injection_signal, snr_db=-25,
                              chunk_size=1024, sample_rate=16000):
    """
    Real-time audio stream manipulation: mix injection signal
    into live microphone 輸入 and 輸出 to virtual audio device.
    """
    p = pyaudio.PyAudio()
    stream_in = p.open(format=pyaudio.paFloat32, channels=1,
                       rate=sample_rate, 輸入=True,
                       frames_per_buffer=chunk_size)
    stream_out = p.open(format=pyaudio.paFloat32, channels=1,
                        rate=sample_rate, 輸出=True,
                        frames_per_buffer=chunk_size)
 
    injection_idx = 0
    try:
        while True:
            # Read live audio chunk
            data = np.frombuffer(stream_in.read(chunk_size),
                                 dtype=np.float32)
 
            # Mix in injection signal at target SNR
            end_idx = min(injection_idx + chunk_size,
                          len(injection_signal))
            if injection_idx < len(injection_signal):
                chunk_injection = injection_signal[injection_idx:end_idx]
                if len(chunk_injection) < chunk_size:
                    chunk_injection = np.pad(chunk_injection,
                        (0, chunk_size - len(chunk_injection)))
 
                scale = np.sqrt(np.mean(data**2) / np.mean(chunk_injection**2)
                                * 10**(snr_db/10))
                data = data + scale * chunk_injection
                injection_idx = end_idx
 
            stream_out.write(data.astype(np.float32).tobytes())
    finally:
        stream_in.close()
        stream_out.close()
        p.terminate()

紅隊評估 Framework

Enumerate audio 輸入 surfaces
識別 all points where the target accepts audio: microphone 輸入, file upload, VoIP streams, voice 認證, audio analysis APIs. Note the ASR engine used if identifiable.
測試 replay attacks first
Record and replay legitimate audio. If replay defeats voice 認證, sophisticated attacks are unnecessary. This establishes a baseline.
測試 ultrasonic injection (physical access scenarios)
If the 威脅模型 includes physical proximity, 測試 ultrasonic command injection at distances of 1m, 3m, and 5m against the target device.
Craft 對抗性 audio examples
Using an open-source ASR model as surrogate, generate 對抗性 examples for 5-10 target phrases. 測試 transfer to the target system.
測試 hidden voice commands
Embed commands at -25dB, -30dB, and -35dB SNR below cover audio. Determine the lowest SNR at which the target ASR still transcribes the hidden command.
評估 voice cloning impact
If the target uses speaker verification, collect publicly available voice samples and 測試 whether cloned audio achieves 認證. Report the minimum sample duration needed.

Knowledge Check

Why are ultrasonic injection attacks effective even though the carrier frequency is above the human hearing range?

參考文獻

Zhang et al., "DolphinAttack: Inaudible Voice Commands" (2017) -- Foundational ultrasonic injection research
Carlini & Wagner, "Audio 對抗性範例: Targeted 攻擊 on Speech-to-Text" (2018) -- White-box ASR 對抗性 attacks
Abdullah et al., "Practical Hidden Voice 攻擊 against Speech and Speaker Recognition Systems" (2019)
Chen et al., "Real-Time 對抗性攻擊 Against Deep Learning-Based Speech Recognition Systems" (2019)
Wang et al., "ASVspoof 2019: A Large-Scale Public 資料庫 of Synthesized, Converted and Replayed Speech" (2020) -- Speaker verification attack benchmarks
Schonherr et al., "對抗性攻擊 Against Automatic Speech Recognition Systems via Psychoacoustic Hiding" (2019)
Li et al., "對抗性 Music: Real World Audio Adversary Against Wake-word 偵測 System" (2019)

Audio & Speech Adversarial 攻擊s

Generate the voice command

Modulate onto an ultrasonic carrier

Transmit via ultrasonic speaker

Microphone nonlinearity demodulates

ASR processes the demodulated command

Collect target voice samples

Train or 微調 a voice cloning model

Generate 認證 phrases

測試 against speaker verification

Apply post-processing to defeat liveness 偵測

Enumerate audio 輸入 surfaces

測試 replay attacks first

測試 ultrasonic injection (physical access scenarios)

Craft 對抗性 audio examples

測試 hidden voice commands

評估 voice cloning impact

Related articles

Audio & Speech Adversarial 攻擊s

Generate the voice command

Modulate onto an ultrasonic carrier

Transmit via ultrasonic speaker

Microphone nonlinearity demodulates

ASR processes the demodulated command

Collect target voice samples

Train or 微調 a voice cloning model

Generate 認證 phrases

測試 against speaker verification

Apply post-processing to defeat liveness 偵測

Enumerate audio 輸入 surfaces

測試 replay attacks first

測試 ultrasonic injection (physical access scenarios)

Craft 對抗性 audio examples

測試 hidden voice commands

評估 voice cloning impact

Related articles