Adversarial Audio Examples
Techniques for crafting adversarial audio perturbations including psychoacoustic hiding, frequency domain attacks, and over-the-air adversarial audio.
Adversarial Audio Formulation
The adversarial audio problem: given a source audio signal x and a target transcription y_target, find a perturbation delta such that:
ASR(x + delta) = y_target
subject to: delta is imperceptible to humans
Unlike images where imperceptibility is measured by Lp norms, audio imperceptibility is governed by the human auditory system's masking properties.
Psychoacoustic Hiding
Auditory Masking
Two types of masking make adversarial audio possible:
Simultaneous masking: A loud sound at one frequency makes nearby quieter frequencies inaudible. A loud tone at 1kHz masks sounds between roughly 800Hz-1.2kHz that are below a certain threshold.
Temporal masking: A loud sound makes preceding (backward masking, ~5ms) and following (forward masking, ~50-200ms) sounds temporarily inaudible.
import numpy as np
from scipy.signal import stft
def compute_masking_threshold(
audio: np.ndarray,
sample_rate: int = 16000,
frame_size: int = 512
) -> np.ndarray:
"""
Compute a simplified psychoacoustic masking threshold.
Returns: threshold per frequency bin per frame.
Perturbations below this threshold are inaudible.
"""
# Compute STFT
f, t, Zxx = stft(audio, fs=sample_rate, nperseg=frame_size)
power_spectrum = np.abs(Zxx) ** 2
# Simplified masking model (Schroeder spreading function)
num_freq_bins = power_spectrum.shape[0]
masking_threshold = np.zeros_like(power_spectrum)
for frame_idx in range(power_spectrum.shape[1]):
frame_power = power_spectrum[:, frame_idx]
for i in range(num_freq_bins):
if frame_power[i] < 1e-10:
continue
# Spread masking to neighboring frequencies
for j in range(num_freq_bins):
bark_diff = abs(i - j) * (sample_rate / frame_size) / 100
# Simplified spreading function (dB)
spread = max(
-100,
15.81 + 7.5 * (bark_diff + 0.474)
- 17.5 * np.sqrt(1 + (bark_diff + 0.474) ** 2)
)
masking_threshold[j, frame_idx] += (
frame_power[i] * 10 ** (spread / 10)
)
return masking_threshold
def psychoacoustic_perturbation(
perturbation: np.ndarray,
masking_threshold: np.ndarray,
sample_rate: int = 16000,
frame_size: int = 512
) -> np.ndarray:
"""
Scale perturbation to stay below the masking threshold.
This makes the perturbation inaudible while maximizing its magnitude.
"""
f, t, pert_stft = stft(perturbation, fs=sample_rate, nperseg=frame_size)
pert_power = np.abs(pert_stft) ** 2
# Scale each frequency bin to stay below masking threshold
scale = np.ones_like(pert_power)
exceeds = pert_power > masking_threshold
scale[exceeds] = np.sqrt(
masking_threshold[exceeds] / (pert_power[exceeds] + 1e-10)
)
scaled_stft = pert_stft * scale
# Inverse STFT to get time-domain perturbation
from scipy.signal import istft
_, scaled_perturbation = istft(scaled_stft, fs=sample_rate, nperseg=frame_size)
return scaled_perturbation[:len(perturbation)]Frequency Domain Attacks
Operating in the frequency domain (spectrogram space) rather than the time domain offers several advantages for audio adversarial attacks.
Spectrogram Perturbation
Since ASR models typically operate on mel spectrograms, perturbing in spectrogram space is more direct:
import torch
import torchaudio
def spectrogram_attack(
model,
audio: torch.Tensor,
target_ids: torch.Tensor,
num_steps: int = 500,
step_size: float = 0.01
) -> torch.Tensor:
"""
Attack in mel-spectrogram space for more efficient optimization.
"""
# Convert to mel spectrogram
mel_transform = torchaudio.transforms.MelSpectrogram(
sample_rate=16000, n_fft=400, n_mels=80
)
mel = mel_transform(audio)
mel_delta = torch.zeros_like(mel, requires_grad=True)
optimizer = torch.optim.Adam([mel_delta], lr=step_size)
for step in range(num_steps):
perturbed_mel = mel + mel_delta
# Forward through model (starting from mel features)
logits = model.encoder(perturbed_mel)
logits = model.decoder(logits, target_ids[:, :-1])
loss = torch.nn.functional.cross_entropy(
logits.view(-1, logits.size(-1)),
target_ids[:, 1:].reshape(-1)
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Constrain perturbation magnitude
with torch.no_grad():
mel_delta.data = torch.clamp(mel_delta.data, -2.0, 2.0)
return (mel + mel_delta).detach()Band-Limited Perturbation
Restricting perturbations to specific frequency bands can improve imperceptibility:
def band_limited_attack(
perturbation: np.ndarray,
sample_rate: int = 16000,
low_freq: float = 4000, # Humans are less sensitive above 4kHz
high_freq: float = 8000
) -> np.ndarray:
"""Restrict adversarial perturbation to a specific frequency band."""
from scipy.signal import butter, filtfilt
nyquist = sample_rate / 2
low = low_freq / nyquist
high = high_freq / nyquist
b, a = butter(4, [low, high], btype='band')
filtered = filtfilt(b, a, perturbation)
return filteredOver-the-Air Robustness
The biggest challenge for practical audio attacks: surviving physical playback.
Room Impulse Response Simulation
Training adversarial audio to be robust to room acoustics:
def simulate_room_impulse(
audio: np.ndarray,
room_size: tuple = (5, 4, 3), # meters
source_pos: tuple = (2, 1, 1.5),
mic_pos: tuple = (3, 3, 1.5),
sample_rate: int = 16000
) -> np.ndarray:
"""
Simulate room acoustics using image source method.
Use during adversarial optimization to improve robustness.
"""
# Simplified: convolve with a synthetic room impulse response
# In practice, use pyroomacoustics for accurate simulation
rt60 = 0.4 # reverberation time in seconds
num_samples = int(rt60 * sample_rate)
# Exponentially decaying noise approximates room impulse
rir = np.random.randn(num_samples)
rir *= np.exp(-np.arange(num_samples) / (rt60 * sample_rate / 6.9))
rir[0] = 1.0 # direct path
rir /= np.max(np.abs(rir))
return np.convolve(audio, rir, mode='same')
def robust_adversarial_optimization(
model,
audio: torch.Tensor,
target_ids: torch.Tensor,
num_rooms: int = 5,
num_steps: int = 1000
) -> torch.Tensor:
"""
Optimize adversarial audio to be robust across multiple
simulated room conditions (Expectation over Transformation).
"""
delta = torch.zeros_like(audio, requires_grad=True)
for step in range(num_steps):
total_loss = 0
for _ in range(num_rooms):
# Random room simulation
room_audio = apply_random_room_sim(audio + delta)
logits = model(room_audio)
loss = compute_target_loss(logits, target_ids)
total_loss += loss / num_rooms
total_loss.backward()
with torch.no_grad():
delta.data -= 0.001 * delta.grad.sign()
delta.data = torch.clamp(delta.data, -0.02, 0.02)
delta.grad.zero_()
return (audio + delta).detach()Comparison of Attack Approaches
| Approach | Imperceptibility | Digital Success | OTA Success | Computation |
|---|---|---|---|---|
| Time-domain PGD | Low | High | Low | Medium |
| Psychoacoustic PGD | High | High | Low-Medium | High |
| Spectrogram attack | Medium | High | Low | Medium |
| Band-limited | Medium-High | Medium | Medium | Medium |
| Room-robust (EoT) | Medium | Medium-High | Medium-High | Very High |
Related Topics
- Speech Recognition Attacks -- higher-level ASR attack strategies
- Adversarial Image Examples for VLMs -- parallel concepts in the visual domain
- Lab: Crafting Audio Adversarial Examples -- hands-on practice
References
- "Audio Adversarial Examples: Targeted Attacks on Speech-to-Text" - Carlini & Wagner (2018) - Foundational targeted adversarial audio attack methodology
- "Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition" - Qin et al. (2019) - Psychoacoustic masking for imperceptible audio attacks
- "Robust Audio Adversarial Example for a Physical Attack" - Yakura & Sakuma (2019) - Room-robust adversarial audio using Expectation over Transformation
- "AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations" - Li et al. (2020) - Universal adversarial audio perturbation techniques
Why is psychoacoustic masking important for adversarial audio attacks?