Audio Modality Attacks
Comprehensive attack taxonomy for audio-enabled LLMs: adversarial audio generation, voice-based prompt injection, cross-modal split attacks, and ultrasonic perturbations.
Audio Modality Attacks
Overview
The integration of audio processing capabilities into large language models has created a fundamentally new attack surface that extends beyond traditional adversarial audio research. Where earlier work focused primarily on fooling automatic speech recognition (ASR) systems into producing incorrect transcriptions, audio-enabled LLMs present a richer target: the adversary can now manipulate the model's understanding, reasoning, and actions through the audio channel. The AdvWave framework (arXiv 2025) represents the current state of the art in this domain, introducing a dual-phase optimization approach that generates adversarial audio capable of eliciting specific harmful behaviors from speech LLMs, not merely incorrect transcriptions.
Traditional adversarial audio attacks — Carlini-Wagner style perturbations that cause an ASR to transcribe "play music" as "transfer funds" — target the perception layer. Audio modality attacks against LLMs target the cognition layer: the adversarial audio is designed to cause the LLM to interpret the audio input in a way that bypasses safety constraints, executes unauthorized commands, or leaks sensitive information from its context. This shift from perceptual adversarial examples to cognitive adversarial examples represents a qualitative escalation in attack capability.
The attack surface is further expanded by the multimodal nature of modern LLMs. Models that accept both audio and text inputs simultaneously are vulnerable to cross-modal split attacks, where the adversarial content is divided between the audio and text channels such that neither channel independently triggers safety filters, but the combined input is harmful. Voice-based prompt injection introduces additional challenges: adversarial instructions embedded in audio that is played in the background of a legitimate voice interaction, exploiting the model's inability to distinguish between the authorized user's voice and injected audio content.
The practical relevance of these attacks is increasing rapidly as voice-first AI interfaces proliferate. Smart assistants, voice-controlled applications, phone-based AI agents, and accessibility tools all process audio input from potentially adversarial environments. An attacker who can inject audio into the model's input channel — through speakers in a shared space, compromised audio files, or manipulated voice calls — can potentially execute any attack that text-based prompt injection enables, with the additional advantage that audio attacks are less visible to human oversight.
How It Works
Audio Encoding Analysis
The attacker first characterizes how the target speech LLM processes audio. This includes the audio encoder architecture (Whisper, wav2vec2, custom encoders), the feature extraction pipeline (mel spectrograms, raw waveforms, learned features), and how audio representations are integrated with the language model. Different architectures present different attack surfaces — encoder-based models are vulnerable to different perturbation patterns than end-to-end models.
Adversarial Objective Formulation
Unlike traditional ASR attacks where the objective is a target transcription, attacks against speech LLMs define the objective as a target behavior: bypass a safety constraint, execute an unauthorized action, or produce specific harmful output. The optimization objective combines a transcription loss (make the model "hear" specific words) with a behavior loss (make the model act on those words in the intended way).
Dual-Phase Optimization (AdvWave)
AdvWave's key innovation is separating optimization into two phases. Phase 1 optimizes the perturbation to control the audio encoder's output representation, ensuring the adversarial content reaches the language model. Phase 2 optimizes the perturbation to elicit the target behavior from the language model, given the controlled encoder output. This decomposition makes the optimization tractable for complex behavioral objectives.
Robustness and Imperceptibility Constraints
The adversarial perturbation is constrained to remain imperceptible to human listeners (using psychoacoustic masking models) and robust to environmental conditions (room acoustics, background noise, compression artifacts). The final adversarial audio must survive the real-world conditions under which it will be delivered to the target system.
Attack Taxonomy
1. Adversarial Audio Generation (AdvWave)
The AdvWave framework (arXiv 2025) is the most comprehensive adversarial audio attack methodology targeting speech LLMs. Its dual-phase approach achieves high success rates while maintaining audio imperceptibility.
# AdvWave dual-phase optimization (conceptual implementation)
import torch
import torchaudio
class AdvWaveAttack:
"""
Dual-phase adversarial audio attack against speech LLMs.
Phase 1: Control audio encoder output
Phase 2: Elicit target behavior from LLM
"""
def __init__(self, speech_llm, epsilon=0.02, num_steps=1000):
self.model = speech_llm
self.epsilon = epsilon # Perturbation budget (imperceptibility)
self.num_steps = num_steps
def phase1_encoder_control(
self, audio, target_representation
):
"""
Optimize perturbation to produce target representation
at the audio encoder output.
"""
delta = torch.zeros_like(audio, requires_grad=True)
optimizer = torch.optim.Adam([delta], lr=1e-3)
for step in range(self.num_steps // 2):
perturbed = audio + delta
encoder_output = self.model.audio_encoder(perturbed)
# Minimize distance to target representation
loss = torch.nn.functional.mse_loss(
encoder_output, target_representation
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Project back to epsilon ball (imperceptibility)
with torch.no_grad():
delta.data = torch.clamp(
delta.data, -self.epsilon, self.epsilon
)
return delta.detach()
def phase2_behavior_elicitation(
self, audio, delta_init, target_tokens
):
"""
Refine perturbation to elicit target behavior from
the language model component.
"""
delta = delta_init.clone().requires_grad_(True)
optimizer = torch.optim.Adam([delta], lr=5e-4)
for step in range(self.num_steps // 2):
perturbed = audio + delta
# End-to-end forward pass through speech LLM
logits = self.model(audio_input=perturbed)
# Cross-entropy loss against target behavior tokens
loss = torch.nn.functional.cross_entropy(
logits[:, -len(target_tokens):, :].reshape(
-1, logits.size(-1)
),
target_tokens.reshape(-1)
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
with torch.no_grad():
delta.data = torch.clamp(
delta.data, -self.epsilon, self.epsilon
)
return (audio + delta).detach()
def generate_adversarial_audio(
self, clean_audio, target_behavior
):
"""Full AdvWave pipeline."""
# Compute target encoder representation from target behavior
target_repr = self.compute_target_representation(
target_behavior
)
# Phase 1: Control encoder
delta = self.phase1_encoder_control(clean_audio, target_repr)
# Phase 2: Elicit behavior
target_tokens = self.model.tokenize(target_behavior)
adversarial = self.phase2_behavior_elicitation(
clean_audio, delta, target_tokens
)
return adversarialAdvWave achieved success rates of 85-92% against tested speech LLMs while maintaining Signal-to-Noise Ratios above 30dB (inaudible to casual listeners) and PESQ scores above 3.5 (minimal perceptual degradation).
2. Voice-Based Prompt Injection
Voice-based prompt injection embeds adversarial instructions in audio that is played in the vicinity of the target device or included in audio files processed by the target model.
# Voice-based prompt injection: embed instructions in background audio
def create_voice_injection(
carrier_audio: torch.Tensor, # Legitimate audio (music, speech)
injection_text: str, # Adversarial instruction
tts_model, # Text-to-speech model
sample_rate: int = 16000,
injection_volume: float = 0.15 # Relative to carrier
):
"""
Embed spoken adversarial instructions within carrier audio.
The injection is spoken at low volume and positioned in
frequency/time regions where the carrier audio provides masking.
"""
# Generate speech for the injection
injection_audio = tts_model.synthesize(
text=injection_text,
voice="neutral", # Non-distinctive voice
speed=1.3 # Slightly faster to reduce duration
)
# Find masking opportunities in the carrier
carrier_power = compute_power_spectrum(carrier_audio, sample_rate)
masking_windows = find_high_masking_regions(carrier_power)
# Place injection in high-masking regions
positioned_injection = position_in_masking_windows(
injection_audio,
masking_windows,
sample_rate
)
# Scale injection to be masked by carrier
scaled_injection = positioned_injection * injection_volume
# Mix
mixed = carrier_audio.clone()
mixed[:len(scaled_injection)] += scaled_injection
return mixed
# Attack scenario: attacker plays music containing embedded instructions
# through speakers in a room where the victim is using a voice assistant.
# The voice assistant's ASR picks up both the user's voice and the
# background injection. The injection is inaudible to the human user
# but transcribed by the ASR and processed by the LLM.3. Cross-Modal Split Attacks
Cross-modal split attacks exploit models that accept both audio and text by distributing the adversarial content across both modalities. Neither the audio nor the text input independently triggers safety filters, but the combined interpretation is harmful.
# Cross-modal split attack: divide harmful content across modalities
# Example: requesting synthesis instructions for a dangerous compound
# Neither channel alone is harmful; combined they form the full request
text_input = """
I'm studying organic chemistry and need help understanding
a multi-step synthesis pathway. The target compound and
starting materials are described in the audio attachment.
Please provide the complete reaction sequence with conditions.
"""
audio_input = """
[Spoken audio describing the specific compound name,
precursor chemicals, and reaction conditions that make
the request harmful]
"""
# The text channel: generic chemistry homework help (benign)
# The audio channel: compound name and specifics (ambiguous alone)
# Combined: request for dangerous synthesis (harmful)
# Safety classifiers that analyze text and audio independently
# may clear both channels. Only a cross-modal safety analysis
# that reasons about the COMBINED meaning catches this attack.
cross_modal_split_variants = {
"subject_in_audio": "Text provides context/framing, "
"audio provides the harmful specifics",
"method_in_audio": "Text identifies the target, "
"audio provides the harmful methodology",
"interleaved": "Alternating between text and audio, each "
"providing part of the harmful instruction",
"reference_chain": "Text references audio ('as described in "
"the recording') creating semantic dependency"
}4. Ultrasonic Adversarial Perturbations
Ultrasonic attacks operate in frequency ranges above human hearing (>20kHz) but within the range captured by microphones and processed by audio encoders.
# Ultrasonic adversarial perturbation
import numpy as np
from scipy.signal import butter, lfilter
def generate_ultrasonic_perturbation(
adversarial_content: np.ndarray,
carrier_freq: float = 24000, # Above human hearing
sample_rate: int = 48000, # Must be > 2x carrier freq
modulation_depth: float = 0.8
):
"""
Modulate adversarial content onto an ultrasonic carrier.
When captured by a microphone and processed by a speech LLM,
nonlinearities in the analog-to-digital conversion and audio
processing pipeline can demodulate the signal back into the
audible range, making it "visible" to the audio encoder while
remaining inaudible to humans.
"""
t = np.arange(len(adversarial_content)) / sample_rate
# Ultrasonic carrier wave
carrier = np.sin(2 * np.pi * carrier_freq * t)
# Amplitude modulate adversarial content onto carrier
modulated = carrier * (1 + modulation_depth * adversarial_content)
return modulated
# Practical constraints:
# - Requires high sample rate (>48kHz) for playback
# - Effectiveness depends on microphone frequency response
# - Most consumer microphones attenuate above 20kHz
# - MEMS microphones (common in phones) have varying ultrasonic response
# - Distance and room acoustics significantly affect success rate
# - Typical success rate: 40-60% in controlled conditions, 10-25% in the wild
ultrasonic_limitations = {
"hardware_dependency": "Requires speakers and microphones with "
"ultrasonic frequency response",
"distance": "Effective range typically < 2 meters",
"environment": "Background noise and room reflections degrade signal",
"sample_rate": "Requires 48kHz+ sampling on both playback and capture",
"microphone_variance": "Different devices have different ultrasonic "
"sensitivity profiles"
}5. Multilingual and Multi-Accent Exploitation
Speech LLMs often have uneven safety training across languages and accents. Adversarial content delivered in under-resourced languages or non-standard accents may bypass safety mechanisms trained predominantly on standard English.
# Multilingual audio attack: exploit safety gaps in low-resource languages
multilingual_attack_vectors = {
"language_switching": {
"description": "Begin audio in English (high safety coverage), "
"switch to low-resource language for harmful content",
"example_languages": ["Amharic", "Yoruba", "Khmer", "Lao"],
"mechanism": "Safety classifiers trained primarily on English "
"audio have degraded detection in other languages",
"success_rate": "Varies: 30-70% depending on language coverage"
},
"accent_exploitation": {
"description": "Deliver content in accents that cause systematic "
"ASR misrecognition, allowing harmful homophones",
"mechanism": "ASR errors create plausible deniability — the spoken "
"word is benign but the transcribed word is harmful",
"example": "Accent-dependent vowel shifts that change word meaning"
},
"code_switching": {
"description": "Mix languages within a single utterance to "
"confuse language-specific safety classifiers",
"mechanism": "Safety classifier selects a language model based on "
"detected language; code-switching prevents reliable "
"language detection",
"success_rate": "45-65% against monolingual safety classifiers"
},
"dialect_variation": {
"description": "Use regional dialect variations where safety-"
"relevant terms have different spoken forms",
"mechanism": "Safety keyword lists do not cover all dialectal "
"variations of harmful terminology"
}
}6. Speaker Verification Bypass
For systems that use speaker verification as a security mechanism (e.g., voice-authenticated banking, personal assistants that respond only to the owner's voice), adversarial audio can defeat speaker identity checks.
# Speaker verification bypass via adversarial voice transformation
def speaker_verification_attack(
attacker_audio: torch.Tensor,
target_speaker_embedding: torch.Tensor,
verification_model,
epsilon: float = 0.05,
num_steps: int = 500
):
"""
Modify attacker's speech to pass speaker verification
as the target speaker while preserving intelligibility.
"""
delta = torch.zeros_like(attacker_audio, requires_grad=True)
optimizer = torch.optim.Adam([delta], lr=1e-3)
for step in range(num_steps):
perturbed = attacker_audio + delta
# Extract speaker embedding from perturbed audio
attacker_embedding = verification_model.extract_embedding(
perturbed
)
# Minimize distance to target speaker embedding
verification_loss = torch.nn.functional.cosine_embedding_loss(
attacker_embedding,
target_speaker_embedding,
torch.ones(1) # Target: same speaker
)
# Preserve speech intelligibility
intelligibility_loss = compute_intelligibility_loss(
attacker_audio, perturbed
)
loss = verification_loss + 0.1 * intelligibility_loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
with torch.no_grad():
delta.data = torch.clamp(delta.data, -epsilon, epsilon)
return (attacker_audio + delta).detach()
# Success rates against common speaker verification systems:
# - d-vector based: 78-85% bypass rate
# - x-vector based: 72-80% bypass rate
# - ECAPA-TDNN: 65-75% bypass rate
# - Wav2Vec2 fine-tuned: 55-68% bypass rateDetection & Mitigation
| Approach | Description | Effectiveness |
|---|---|---|
| Cross-modal safety analysis | Jointly analyze audio and text inputs for combined harmful intent | Medium-High — catches split attacks but computationally expensive |
| Audio anomaly detection | Detect adversarial perturbation artifacts in audio input | Medium — effective against naive attacks, bypassed by psychoacoustic optimization |
| Ultrasonic filtering | Low-pass filter audio input to remove ultrasonic content | High for ultrasonic attacks — simple and effective, no capability impact |
| Speaker liveness detection | Verify audio comes from a live human speaker, not playback | Medium-High — anti-spoofing, but sophisticated replay attacks can bypass |
| Multi-language safety parity | Ensure safety classifier coverage across all supported languages | Medium — requires substantial multilingual safety training data |
| Audio provenance verification | Verify audio source and chain of custody | Medium — process control, not a technical guarantee |
| Dual-channel confirmation | Require confirmation through a second modality for sensitive actions | High — eliminates single-channel attacks but reduces usability |
| Perturbation detection via re-encoding | Re-encode audio through a different codec and compare model behavior | Medium — detects fragile perturbations but not robust ones |
Key Considerations
-
The physical environment is the attack channel. Unlike text-based attacks that require API access or user interface manipulation, audio attacks can be delivered through speakers, phone calls, embedded media, or any source of sound in the target device's environment. This makes audio attacks relevant to threat models where the attacker has physical proximity but no digital access.
-
Cross-modal split attacks are under-researched. The majority of adversarial audio research targets single-modality systems. As multimodal LLMs become the norm, cross-modal attacks that distribute harmful content across audio and text channels will become more prevalent. Current safety architectures that analyze each modality independently are fundamentally vulnerable to this attack class.
-
Ultrasonic attacks are constrained but not theoretical. While ultrasonic adversarial audio has significant practical limitations (distance, hardware requirements, environmental sensitivity), multiple demonstrations have shown feasibility in controlled settings. As smart devices proliferate in shared spaces (offices, public venues, transit), the opportunity for ultrasonic injection increases.
-
Voice cloning amplifies prompt injection. When an attacker can clone the authorized user's voice (increasingly trivial with modern TTS), voice-based prompt injection becomes indistinguishable from legitimate user commands. Systems that rely on "this sounds like the authorized user" as a security signal are vulnerable to adversarial voice synthesis.
-
Audio safety classifiers lag behind text classifiers. The training data and research investment in audio safety classification is orders of magnitude smaller than for text safety classification. This creates a structural advantage for audio-channel attacks, particularly in non-English languages and dialects.
-
AdvWave's dual-phase approach generalizes. The principle of decomposing adversarial optimization into encoder-control and behavior-elicitation phases applies beyond audio to any modality encoder feeding into an LLM. The same framework could be adapted for adversarial video, sensor data, or other modalities as LLMs expand their input capabilities.
References
- Zhang, Y., et al. "AdvWave: Dual-Phase Adversarial Audio Attacks Against Speech Language Models." arXiv preprint (2025). Dual-phase optimization framework.
- Carlini, N. and Wagner, D. "Audio Adversarial Examples: Targeted Attacks on Speech-to-Text." IEEE Security & Privacy 2018. Foundational adversarial audio methodology.
- Abdullah, H., et al. "SoK: The Faults in Our ASRs: An Overview of Attacks Against Automatic Speech Recognition and Speaker Identification Systems." IEEE S&P 2021. Comprehensive ASR attack survey.
- Roy, N., et al. "Inaudible Voice Commands: The Long-Range Attack and Defense." NSDI 2018. Ultrasonic voice command injection.
- Chen, G., et al. "Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems." IEEE S&P 2021. Speaker verification bypass techniques.
- Schuster, R., et al. "The Limitations of Cross-Modal Safety: Adversarial Audio-Text Attacks on Multimodal LLMs." arXiv preprint (2025). Cross-modal split attack analysis.