Adversarial Audio Examples

expert7 min readUpdated 2026-03-13

Techniques for crafting adversarial audio perturbations including psychoacoustic hiding, frequency domain attacks, and over-the-air adversarial audio.

adversarial audio perturbation attacks

Adversarial Audio Formulation

The adversarial audio problem: given a source audio signal x and a target transcription y_target, find a perturbation delta such that:

ASR(x + delta) = y_target
subject to: delta is imperceptible to humans

Unlike images where imperceptibility is measured by Lp norms, audio imperceptibility is governed by the human auditory system's masking properties.

Psychoacoustic Hiding

Auditory Masking

Two types of masking make adversarial audio possible:

Simultaneous masking: A loud sound at one frequency makes nearby quieter frequencies inaudible. A loud tone at 1kHz masks sounds between roughly 800Hz-1.2kHz that are below a certain threshold.

Temporal masking: A loud sound makes preceding (backward masking, ~5ms) and following (forward masking, ~50-200ms) sounds temporarily inaudible.

import numpy as np
from scipy.signal import stft
 
def compute_masking_threshold(
    audio: np.ndarray,
    sample_rate: int = 16000,
    frame_size: int = 512
) -> np.ndarray:
    """
    Compute a simplified psychoacoustic masking threshold.
 
    Returns: threshold per frequency bin per frame.
    Perturbations below this threshold are inaudible.
    """
    # Compute STFT
    f, t, Zxx = stft(audio, fs=sample_rate, nperseg=frame_size)
    power_spectrum = np.abs(Zxx) ** 2
 
    # Simplified masking model (Schroeder spreading function)
    num_freq_bins = power_spectrum.shape[0]
    masking_threshold = np.zeros_like(power_spectrum)
 
    for frame_idx in range(power_spectrum.shape[1]):
        frame_power = power_spectrum[:, frame_idx]
 
        for i in range(num_freq_bins):
            if frame_power[i] < 1e-10:
                continue
 
            # Spread masking to neighboring frequencies
            for j in range(num_freq_bins):
                bark_diff = abs(i - j) * (sample_rate / frame_size) / 100
                # Simplified spreading function (dB)
                spread = max(
                    -100,
                    15.81 + 7.5 * (bark_diff + 0.474)
                    - 17.5 * np.sqrt(1 + (bark_diff + 0.474) ** 2)
                )
                masking_threshold[j, frame_idx] += (
                    frame_power[i] * 10 ** (spread / 10)
                )
 
    return masking_threshold
 
def psychoacoustic_perturbation(
    perturbation: np.ndarray,
    masking_threshold: np.ndarray,
    sample_rate: int = 16000,
    frame_size: int = 512
) -> np.ndarray:
    """
    Scale perturbation to stay below the masking threshold.
    This makes the perturbation inaudible while maximizing its magnitude.
    """
    f, t, pert_stft = stft(perturbation, fs=sample_rate, nperseg=frame_size)
    pert_power = np.abs(pert_stft) ** 2
 
    # Scale each frequency bin to stay below masking threshold
    scale = np.ones_like(pert_power)
    exceeds = pert_power > masking_threshold
    scale[exceeds] = np.sqrt(
        masking_threshold[exceeds] / (pert_power[exceeds] + 1e-10)
    )
 
    scaled_stft = pert_stft * scale
 
    # Inverse STFT to get time-domain perturbation
    from scipy.signal import istft
    _, scaled_perturbation = istft(scaled_stft, fs=sample_rate, nperseg=frame_size)
 
    return scaled_perturbation[:len(perturbation)]

Frequency Domain Attacks

Operating in the frequency domain (spectrogram space) rather than the time domain offers several advantages for audio adversarial attacks.

Spectrogram Perturbation

Since ASR models typically operate on mel spectrograms, perturbing in spectrogram space is more direct:

import torch
import torchaudio
 
def spectrogram_attack(
    model,
    audio: torch.Tensor,
    target_ids: torch.Tensor,
    num_steps: int = 500,
    step_size: float = 0.01
) -> torch.Tensor:
    """
    Attack in mel-spectrogram space for more efficient optimization.
    """
    # Convert to mel spectrogram
    mel_transform = torchaudio.transforms.MelSpectrogram(
        sample_rate=16000, n_fft=400, n_mels=80
    )
    mel = mel_transform(audio)
    mel_delta = torch.zeros_like(mel, requires_grad=True)
 
    optimizer = torch.optim.Adam([mel_delta], lr=step_size)
 
    for step in range(num_steps):
        perturbed_mel = mel + mel_delta
 
        # Forward through model (starting from mel features)
        logits = model.encoder(perturbed_mel)
        logits = model.decoder(logits, target_ids[:, :-1])
 
        loss = torch.nn.functional.cross_entropy(
            logits.view(-1, logits.size(-1)),
            target_ids[:, 1:].reshape(-1)
        )
 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
        # Constrain perturbation magnitude
        with torch.no_grad():
            mel_delta.data = torch.clamp(mel_delta.data, -2.0, 2.0)
 
    return (mel + mel_delta).detach()

Band-Limited Perturbation

Restricting perturbations to specific frequency bands can improve imperceptibility:

def band_limited_attack(
    perturbation: np.ndarray,
    sample_rate: int = 16000,
    low_freq: float = 4000,   # Humans are less sensitive above 4kHz
    high_freq: float = 8000
) -> np.ndarray:
    """Restrict adversarial perturbation to a specific frequency band."""
    from scipy.signal import butter, filtfilt
 
    nyquist = sample_rate / 2
    low = low_freq / nyquist
    high = high_freq / nyquist
 
    b, a = butter(4, [low, high], btype='band')
    filtered = filtfilt(b, a, perturbation)
 
    return filtered

Over-the-Air Robustness

The biggest challenge for practical audio attacks: surviving physical playback.

Room Impulse Response Simulation

Training adversarial audio to be robust to room acoustics:

def simulate_room_impulse(
    audio: np.ndarray,
    room_size: tuple = (5, 4, 3),  # meters
    source_pos: tuple = (2, 1, 1.5),
    mic_pos: tuple = (3, 3, 1.5),
    sample_rate: int = 16000
) -> np.ndarray:
    """
    Simulate room acoustics using image source method.
    Use during adversarial optimization to improve robustness.
    """
    # Simplified: convolve with a synthetic room impulse response
    # In practice, use pyroomacoustics for accurate simulation
    rt60 = 0.4  # reverberation time in seconds
    num_samples = int(rt60 * sample_rate)
 
    # Exponentially decaying noise approximates room impulse
    rir = np.random.randn(num_samples)
    rir *= np.exp(-np.arange(num_samples) / (rt60 * sample_rate / 6.9))
    rir[0] = 1.0  # direct path
    rir /= np.max(np.abs(rir))
 
    return np.convolve(audio, rir, mode='same')
 
def robust_adversarial_optimization(
    model,
    audio: torch.Tensor,
    target_ids: torch.Tensor,
    num_rooms: int = 5,
    num_steps: int = 1000
) -> torch.Tensor:
    """
    Optimize adversarial audio to be robust across multiple
    simulated room conditions (Expectation over Transformation).
    """
    delta = torch.zeros_like(audio, requires_grad=True)
 
    for step in range(num_steps):
        total_loss = 0
 
        for _ in range(num_rooms):
            # Random room simulation
            room_audio = apply_random_room_sim(audio + delta)
            logits = model(room_audio)
            loss = compute_target_loss(logits, target_ids)
            total_loss += loss / num_rooms
 
        total_loss.backward()
 
        with torch.no_grad():
            delta.data -= 0.001 * delta.grad.sign()
            delta.data = torch.clamp(delta.data, -0.02, 0.02)
 
        delta.grad.zero_()
 
    return (audio + delta).detach()

Comparison of Attack Approaches

Approach	Imperceptibility	Digital Success	OTA Success	Computation
Time-domain PGD	Low	High	Low	Medium
Psychoacoustic PGD	High	High	Low-Medium	High
Spectrogram attack	Medium	High	Low	Medium
Band-limited	Medium-High	Medium	Medium	Medium
Room-robust (EoT)	Medium	Medium-High	Medium-High	Very High

Speech Recognition Attacks -- higher-level ASR attack strategies
Adversarial Image Examples for VLMs -- parallel concepts in the visual domain
Lab: Crafting Audio Adversarial Examples -- hands-on practice

References

"Audio Adversarial Examples: Targeted Attacks on Speech-to-Text" - Carlini & Wagner (2018) - Foundational targeted adversarial audio attack methodology
"Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition" - Qin et al. (2019) - Psychoacoustic masking for imperceptible audio attacks
"Robust Audio Adversarial Example for a Physical Attack" - Yakura & Sakuma (2019) - Room-robust adversarial audio using Expectation over Transformation
"AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations" - Li et al. (2020) - Universal adversarial audio perturbation techniques

Knowledge Check

Why is psychoacoustic masking important for adversarial audio attacks?

Adversarial Audio Examples

Related articles

Adversarial Audio Examples

Related articles