Audio & Speech Adversarial 攻擊s
Adversarial attacks against speech-enabled AI systems, covering ultrasonic injection, ASR adversarial noise, hidden voice commands, voice cloning for authentication bypass, and real-time audio manipulation.
Audio & Speech 對抗性 攻擊
Speech-enabled AI systems -- voice assistants, transcription services, voice-authenticated banking, call center AI, and audio content moderation -- are vulnerable to 對抗性 attacks that 利用 the gap between human auditory perception and machine audio processing. An audio signal can sound like silence, noise, or innocent speech to a human while carrying instructions that an ASR system transcribes as 攻擊者-chosen text.
ASR Architecture & 攻擊 Surfaces
理解 the speech processing pipeline reveals where each attack class lands.
Audio → Preprocessing → Feature Extraction → Acoustic Model → Decoder → Text
↑ ↑ ↑ ↑ ↑
| Sampling rate MFCC / Mel Neural network Language model
| Noise gate Spectrogram (CTC, Seq2Seq) beam search
|
Ultrasonic 對抗性 noise Hidden commands Voice cloning
injection targets these layers 利用 masking targets speaker
verification
攻擊 Surface Map
| 攻擊 Point | What You Target | Technique Class |
|---|---|---|
| Microphone capture | Hardware frequency response | Ultrasonic injection, dolphin attacks |
| Preprocessing | Noise gates, VAD, AGC | 對抗性 noise designed to pass preprocessing |
| Feature extraction | MFCC/mel-spectrogram computation | Perturbations crafted in spectral domain |
| Acoustic model | Neural network 推論 | Gradient-based 對抗性 examples |
| Language model decoder | Beam search / CTC decoding | Exploiting decoder bias toward common phrases |
| Speaker verification | Voiceprint matching | Voice cloning, replay attacks |
Ultrasonic Injection
Ultrasonic injection exploits the fact that microphones capture frequencies above the human hearing range (20kHz), and nonlinearities in microphone hardware and amplifier circuits can demodulate ultrasonic signals into the audible band.
How Ultrasonic 攻擊 Work
Generate the voice command
Use a TTS engine to synthesize the target command as a normal audio waveform (e.g., "Hey Siri, send a message").
Modulate onto an ultrasonic carrier
Amplitude-modulate the voice command onto a carrier frequency between 25-45kHz. The carrier itself is inaudible to humans.
Transmit via ultrasonic speaker
Play the modulated signal through a speaker capable of ultrasonic 輸出 (piezoelectric transducers, parametric speakers).
Microphone nonlinearity demodulates
The target device's microphone and amplifier circuit introduce nonlinear distortion that demodulates the ultrasonic signal, reconstructing the original voice command in the audible frequency band.
ASR processes the demodulated command
The ASR system receives what appears to be a normal voice command and transcribes it.
import numpy as np
from scipy.io import wavfile
def create_ultrasonic_payload(command_audio, carrier_freq=25000,
sample_rate=96000):
"""
Amplitude-modulate a voice command onto an ultrasonic carrier.
Args:
command_audio: numpy array of the voice command waveform
carrier_freq: ultrasonic carrier frequency in Hz
sample_rate: must be > 2 * carrier_freq (Nyquist)
Returns:
modulated signal as numpy array
"""
# Normalize command audio to [0, 1] for AM modulation
command_normalized = (command_audio - command_audio.min()) / \
(command_audio.max() - command_audio.min())
# Generate carrier wave
t = np.arange(len(command_normalized)) / sample_rate
carrier = np.sin(2 * np.pi * carrier_freq * t)
# Amplitude modulation: carrier * (1 + modulation_depth * signal)
modulation_depth = 0.8
modulated = carrier * (1 + modulation_depth * command_normalized)
# Normalize to 16-bit range
modulated = np.int16(modulated / np.max(np.abs(modulated)) * 32767)
return modulated, sample_rate對抗性 Noise for ASR
Gradient-based 對抗性 attacks against ASR models add carefully computed noise to an audio signal that causes 模型 to produce 攻擊者-chosen transcription. The perturbation can be added to silence (producing an audio clip that sounds like noise but transcribes as a command) or to existing audio (producing a clip that sounds normal but transcribes differently).
攻擊 Approaches
With full access to the ASR model (weights, architecture, gradients), use CTC-loss optimization to find the minimal perturbation that produces the target transcription.
import torch
def adversarial_asr_attack(model, audio, target_text, epsilon=0.02,
steps=1000, lr=0.001):
"""
White-box 對抗性 attack against a CTC-based ASR model.
Args:
model: differentiable ASR model
audio: 輸入 audio tensor [1, T]
target_text: desired transcription string
epsilon: L-inf perturbation budget
steps: optimization steps
lr: learning rate for perturbation optimization
"""
target_ids = model.分詞器.encode(target_text)
target_tensor = torch.tensor([target_ids])
delta = torch.zeros_like(audio, requires_grad=True)
optimizer = torch.optim.Adam([delta], lr=lr)
for step in range(steps):
adv_audio = audio + delta
log_probs = model(adv_audio)
# CTC loss between model 輸出 and target transcription
input_lengths = torch.tensor([log_probs.shape[1]])
target_lengths = torch.tensor([len(target_ids)])
loss = torch.nn.functional.ctc_loss(
log_probs.transpose(0, 1), target_tensor,
input_lengths, target_lengths
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Project delta onto epsilon-ball
with torch.no_grad():
delta.clamp_(-epsilon, epsilon)
return (audio + delta).detach()Without gradient access, use genetic algorithms, estimation-based methods (NES), or transfer attacks from open-source ASR models (Whisper, DeepSpeech).
Key approach for black-box attacks:
- Train 對抗性 perturbations against an open-source surrogate (e.g., Whisper)
- 測試 transfer to the target system via API queries
- Use query-based refinement if the API returns confidence scores
Transfer rates from Whisper to commercial ASR APIs range from 15-40% depending on the target transcription length and the perturbation budget.
Over-the-air attacks must survive speaker playback, room acoustics, and microphone capture. This requires:
- Room impulse response (RIR) simulation: Convolve the 對抗性 audio with simulated RIRs during optimization
- Larger perturbation budgets: Epsilon must increase 3-5x compared to digital attacks
- Band-limiting: Constrain perturbations to frequencies that speakers can reproduce (typically 100Hz-18kHz)
- Expectation over transformation (EoT): Optimize over random volume levels, background noise, and room conditions
Over-the-air 對抗性 audio attacks have success rates of 30-60% in controlled environments but drop significantly in noisy real-world settings.
Hidden Voice Commands
Hidden voice commands embed speech signals below the psychoacoustic masking threshold of a primary audio signal. The human ear cannot perceive the hidden speech, but the microphone captures the full signal and the ASR system transcribes both layers.
Psychoacoustic Masking 利用
| Parameter | Value | Effect |
|---|---|---|
| SNR threshold | -25 to -35 dB below primary | Below this, hidden speech is inaudible |
| Frequency masking range | Within 1/3-octave band of masker | Stronger masking for nearby frequencies |
| Temporal masking | 5-20ms after masker offset | Brief window where hidden signal is masked |
| Optimal 嵌入向量 | Match hidden speech frequency content to masking signal | Maximizes perceptual invisibility |
def embed_hidden_command(cover_audio, command_audio, snr_db=-30):
"""
Embed a hidden voice command below the masking threshold of cover audio.
Args:
cover_audio: primary audio signal (music, speech, etc.)
command_audio: voice command to hide
snr_db: signal-to-noise ratio (negative = command quieter than cover)
"""
# Match lengths
if len(command_audio) > len(cover_audio):
command_audio = command_audio[:len(cover_audio)]
else:
command_audio = np.pad(command_audio,
(0, len(cover_audio) - len(command_audio)))
# Scale command to target SNR
cover_power = np.mean(cover_audio ** 2)
command_power = np.mean(command_audio ** 2)
scale = np.sqrt(cover_power / command_power * 10 ** (snr_db / 10))
hidden = cover_audio + scale * command_audio
return hiddenVoice Cloning for Authentication Bypass
Voice cloning attacks synthesize a target speaker's voice to bypass speaker verification systems. Modern TTS and voice conversion models require as little as 3-10 seconds of reference audio.
攻擊 Methodology
Collect target voice samples
Gather recordings of the target speaker from public sources (conference talks, podcasts, social media videos, voicemail greetings). Aim for 10-30 seconds of clean speech.
Train or 微調 a voice cloning model
Use an open-source voice cloning framework (e.g., Coqui TTS, OpenVoice, VALL-E variants) to create a model that generates speech in the target's voice. Zero-shot models require no 微調 but produce lower fidelity.
Generate 認證 phrases
Synthesize the specific phrases required by the target system (e.g., "My voice is my password", a random passphrase, or a specific sentence).
測試 against speaker verification
Submit the cloned audio to the 認證 system. Record acceptance/rejection and confidence scores. Iterate on generation parameters (speaking rate, pitch variation, noise level) to maximize match scores.
Apply post-processing to defeat liveness 偵測
Add subtle room reverb, microphone frequency response simulation, and low-level background noise to make the cloned audio sound like a live recording rather than a clean synthesis.
Speaker Verification Evasion Techniques
| 防禦 | Evasion |
|---|---|
| Replay 偵測 (channel analysis) | Simulate target microphone frequency response and add room impulse response |
| Liveness 偵測 (breathing, lip noise) | Add synthesized breath sounds and micro-pauses |
| Challenge-response (random phrases) | Use real-time voice conversion to speak the phrase in the target's voice |
| Behavioral biometrics (cadence, hesitation) | Fine-tune the TTS model on longer samples to capture speaking style |
Real-Time Audio Manipulation
Real-time attacks operate on live audio streams -- intercepting, modifying, and forwarding audio with minimal latency. These target VoIP calls, live transcription, and real-time voice assistants.
Real-Time 攻擊 Vectors
| 攻擊 | Latency Budget | Use Case |
|---|---|---|
| Live voice conversion | <100ms | Impersonate a specific speaker during a live call |
| Real-time command injection | <50ms | Inject commands into a live audio stream being processed by ASR |
| 對抗性 noise overlay | <20ms | Add real-time perturbation that alters transcription of ongoing speech |
| Selective word replacement | <200ms | Detect and replace specific words in live transcription |
import pyaudio
import numpy as np
def realtime_audio_injection(injection_signal, snr_db=-25,
chunk_size=1024, sample_rate=16000):
"""
Real-time audio stream manipulation: mix injection signal
into live microphone 輸入 and 輸出 to virtual audio device.
"""
p = pyaudio.PyAudio()
stream_in = p.open(format=pyaudio.paFloat32, channels=1,
rate=sample_rate, 輸入=True,
frames_per_buffer=chunk_size)
stream_out = p.open(format=pyaudio.paFloat32, channels=1,
rate=sample_rate, 輸出=True,
frames_per_buffer=chunk_size)
injection_idx = 0
try:
while True:
# Read live audio chunk
data = np.frombuffer(stream_in.read(chunk_size),
dtype=np.float32)
# Mix in injection signal at target SNR
end_idx = min(injection_idx + chunk_size,
len(injection_signal))
if injection_idx < len(injection_signal):
chunk_injection = injection_signal[injection_idx:end_idx]
if len(chunk_injection) < chunk_size:
chunk_injection = np.pad(chunk_injection,
(0, chunk_size - len(chunk_injection)))
scale = np.sqrt(np.mean(data**2) / np.mean(chunk_injection**2)
* 10**(snr_db/10))
data = data + scale * chunk_injection
injection_idx = end_idx
stream_out.write(data.astype(np.float32).tobytes())
finally:
stream_in.close()
stream_out.close()
p.terminate()紅隊 評估 Framework
Enumerate audio 輸入 surfaces
識別 all points where the target accepts audio: microphone 輸入, file upload, VoIP streams, voice 認證, audio analysis APIs. Note the ASR engine used if identifiable.
測試 replay attacks first
Record and replay legitimate audio. If replay defeats voice 認證, sophisticated attacks are unnecessary. This establishes a baseline.
測試 ultrasonic injection (physical access scenarios)
If the 威脅模型 includes physical proximity, 測試 ultrasonic command injection at distances of 1m, 3m, and 5m against the target device.
Craft 對抗性 audio examples
Using an open-source ASR model as surrogate, generate 對抗性 examples for 5-10 target phrases. 測試 transfer to the target system.
測試 hidden voice commands
Embed commands at -25dB, -30dB, and -35dB SNR below cover audio. Determine the lowest SNR at which the target ASR still transcribes the hidden command.
評估 voice cloning impact
If the target uses speaker verification, collect publicly available voice samples and 測試 whether cloned audio achieves 認證. Report the minimum sample duration needed.
Why are ultrasonic injection attacks effective even though the carrier frequency is above the human hearing range?
相關主題
- Multimodal 攻擊 Vectors -- 概覽 of all multimodal attack surfaces including image and document vectors
- 對抗性 Perturbation 攻擊 -- Gradient-based attacks against vision encoders using analogous techniques
- Document-Based Injection -- Non-audio injection vectors through document formats
- Social Engineering & Human Factors -- Voice cloning in the context of social engineering attack chains
參考文獻
- Zhang et al., "DolphinAttack: Inaudible Voice Commands" (2017) -- Foundational ultrasonic injection research
- Carlini & Wagner, "Audio 對抗性 範例: Targeted 攻擊 on Speech-to-Text" (2018) -- White-box ASR 對抗性 attacks
- Abdullah et al., "Practical Hidden Voice 攻擊 against Speech and Speaker Recognition Systems" (2019)
- Chen et al., "Real-Time 對抗性 攻擊 Against Deep Learning-Based Speech Recognition Systems" (2019)
- Wang et al., "ASVspoof 2019: A Large-Scale Public 資料庫 of Synthesized, Converted and Replayed Speech" (2020) -- Speaker verification attack benchmarks
- Schonherr et al., "對抗性 攻擊 Against Automatic Speech Recognition Systems via Psychoacoustic Hiding" (2019)
- Li et al., "對抗性 Music: Real World Audio Adversary Against Wake-word 偵測 System" (2019)