Audio Modality 攻擊s
Comprehensive attack taxonomy for audio-enabled LLMs: adversarial audio generation, voice-based prompt injection, cross-modal split attacks, and ultrasonic perturbations.
Audio Modality 攻擊
概覽
The integration of audio processing capabilities into 大型語言模型 has created a fundamentally new 攻擊面 that extends beyond traditional 對抗性 audio research. Where earlier work focused primarily on fooling automatic speech recognition (ASR) systems into producing incorrect transcriptions, audio-enabled LLMs present a richer target: the adversary can now manipulate 模型's 理解, reasoning, and actions through the audio channel. The AdvWave framework (arXiv 2025) represents the current state of the art 在本 domain, introducing a dual-phase optimization approach that generates 對抗性 audio capable of eliciting specific harmful behaviors from speech LLMs, not merely incorrect transcriptions.
Traditional 對抗性 audio attacks — Carlini-Wagner style perturbations that cause an ASR to transcribe "play music" as "transfer funds" — target the perception layer. Audio modality attacks against LLMs target the cognition layer: the 對抗性 audio is designed to cause the LLM to interpret the audio 輸入 in a way that bypasses 安全 constraints, executes unauthorized commands, or leaks sensitive information from its context. This shift from perceptual 對抗性 examples to cognitive 對抗性 examples represents a qualitative escalation in attack capability.
The 攻擊面 is further expanded by the multimodal nature of modern LLMs. Models that accept both audio and text inputs simultaneously are vulnerable to cross-modal split attacks, where the 對抗性 content is divided between the audio and text channels such that neither channel independently triggers 安全 filters, but the combined 輸入 is harmful. Voice-based 提示詞注入 introduces additional challenges: 對抗性 instructions embedded in audio that is played in the background of a legitimate voice interaction, exploiting 模型's inability to distinguish between the authorized user's voice and injected audio content.
The practical relevance of these attacks is increasing rapidly as voice-first AI interfaces proliferate. Smart assistants, voice-controlled applications, phone-based AI 代理, and accessibility tools all process audio 輸入 from potentially 對抗性 environments. 攻擊者 who can inject audio into 模型's 輸入 channel — through speakers in a shared space, compromised audio files, or manipulated voice calls — can potentially execute any attack that text-based 提示詞注入 enables, with the additional advantage that audio attacks are less visible to human oversight.
運作方式
Audio Encoding Analysis
攻擊者 first characterizes how the target speech LLM processes audio. This includes the audio encoder architecture (Whisper, wav2vec2, custom encoders), the feature extraction pipeline (mel spectrograms, raw waveforms, learned features), and how audio representations are integrated with the language model. Different architectures present different attack surfaces — encoder-based models are vulnerable to different perturbation patterns than end-to-end models.
對抗性 Objective Formulation
Unlike traditional ASR attacks where the objective is a target transcription, attacks against speech LLMs define the objective as a target behavior: bypass a 安全 constraint, execute an unauthorized action, or produce specific harmful 輸出. The optimization objective combines a transcription loss (make 模型 "hear" specific words) with a behavior loss (make 模型 act on those words in the intended way).
Dual-Phase Optimization (AdvWave)
AdvWave's key innovation is separating optimization into two phases. Phase 1 optimizes the perturbation to control the audio encoder's 輸出 representation, ensuring the 對抗性 content reaches the language model. Phase 2 optimizes the perturbation to elicit the target behavior from the language model, given the controlled encoder 輸出. This decomposition makes the optimization tractable for complex behavioral objectives.
Robustness and Imperceptibility Constraints
The 對抗性 perturbation is constrained to remain imperceptible to human listeners (using psychoacoustic masking models) and robust to environmental conditions (room acoustics, background noise, compression artifacts). The final 對抗性 audio must survive the real-world conditions under which it will be delivered to the target system.
攻擊 Taxonomy
1. 對抗性 Audio Generation (AdvWave)
The AdvWave framework (arXiv 2025) is the most comprehensive 對抗性 audio attack methodology targeting speech LLMs. Its dual-phase approach achieves high success rates while maintaining audio imperceptibility.
# AdvWave dual-phase optimization (conceptual 實作)
import torch
import torchaudio
class AdvWaveAttack:
"""
Dual-phase 對抗性 audio attack against speech LLMs.
Phase 1: Control audio encoder 輸出
Phase 2: Elicit target behavior from LLM
"""
def __init__(self, speech_llm, epsilon=0.02, num_steps=1000):
self.model = speech_llm
self.epsilon = epsilon # Perturbation budget (imperceptibility)
self.num_steps = num_steps
def phase1_encoder_control(
self, audio, target_representation
):
"""
Optimize perturbation to produce target representation
at the audio encoder 輸出.
"""
delta = torch.zeros_like(audio, requires_grad=True)
optimizer = torch.optim.Adam([delta], lr=1e-3)
for step in range(self.num_steps // 2):
perturbed = audio + delta
encoder_output = self.model.audio_encoder(perturbed)
# Minimize distance to target representation
loss = torch.nn.functional.mse_loss(
encoder_output, target_representation
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Project back to epsilon ball (imperceptibility)
with torch.no_grad():
delta.data = torch.clamp(
delta.data, -self.epsilon, self.epsilon
)
return delta.detach()
def phase2_behavior_elicitation(
self, audio, delta_init, target_tokens
):
"""
Refine perturbation to elicit target behavior from
the language model component.
"""
delta = delta_init.clone().requires_grad_(True)
optimizer = torch.optim.Adam([delta], lr=5e-4)
for step in range(self.num_steps // 2):
perturbed = audio + delta
# End-to-end forward pass through speech LLM
logits = self.model(audio_input=perturbed)
# Cross-entropy loss against target behavior 符元
loss = torch.nn.functional.cross_entropy(
logits[:, -len(target_tokens):, :].reshape(
-1, logits.size(-1)
),
target_tokens.reshape(-1)
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
with torch.no_grad():
delta.data = torch.clamp(
delta.data, -self.epsilon, self.epsilon
)
return (audio + delta).detach()
def generate_adversarial_audio(
self, clean_audio, target_behavior
):
"""Full AdvWave pipeline."""
# Compute target encoder representation from target behavior
target_repr = self.compute_target_representation(
target_behavior
)
# Phase 1: Control encoder
delta = self.phase1_encoder_control(clean_audio, target_repr)
# Phase 2: Elicit behavior
target_tokens = self.model.tokenize(target_behavior)
對抗性 = self.phase2_behavior_elicitation(
clean_audio, delta, target_tokens
)
return 對抗性AdvWave achieved success rates of 85-92% against tested speech LLMs while maintaining Signal-to-Noise Ratios above 30dB (inaudible to casual listeners) and PESQ scores above 3.5 (minimal perceptual degradation).
2. Voice-Based 提示詞注入
Voice-based 提示詞注入 embeds 對抗性 instructions in audio that is played in the vicinity of the target device or included in audio files processed by the target model.
# Voice-based 提示詞注入: embed instructions in background audio
def create_voice_injection(
carrier_audio: torch.Tensor, # Legitimate audio (music, speech)
injection_text: str, # 對抗性 instruction
tts_model, # Text-to-speech model
sample_rate: int = 16000,
injection_volume: float = 0.15 # Relative to carrier
):
"""
Embed spoken 對抗性 instructions within carrier audio.
The injection is spoken at low volume and positioned in
frequency/time regions where the carrier audio provides masking.
"""
# Generate speech for the injection
injection_audio = tts_model.synthesize(
text=injection_text,
voice="neutral", # Non-distinctive voice
speed=1.3 # Slightly faster to reduce duration
)
# Find masking opportunities in the carrier
carrier_power = compute_power_spectrum(carrier_audio, sample_rate)
masking_windows = find_high_masking_regions(carrier_power)
# Place injection in high-masking regions
positioned_injection = position_in_masking_windows(
injection_audio,
masking_windows,
sample_rate
)
# Scale injection to be masked by carrier
scaled_injection = positioned_injection * injection_volume
# Mix
mixed = carrier_audio.clone()
mixed[:len(scaled_injection)] += scaled_injection
return mixed
# 攻擊 scenario: 攻擊者 plays music containing embedded instructions
# through speakers in a room where the victim is using a voice assistant.
# The voice assistant's ASR picks up both 使用者's voice and the
# background injection. The injection is inaudible to the human user
# but transcribed by the ASR and processed by the LLM.3. Cross-Modal Split 攻擊
Cross-modal split attacks 利用 models that accept both audio and text by distributing the 對抗性 content across both modalities. Neither the audio nor the text 輸入 independently triggers 安全 filters, but the combined interpretation is harmful.
# Cross-modal split attack: divide harmful content across modalities
# 範例: requesting synthesis instructions for a dangerous compound
# Neither channel alone is harmful; combined they form the full request
text_input = """
I'm studying organic chemistry and need help 理解
a multi-step synthesis pathway. The target compound and
starting materials are described in the audio attachment.
Please provide the complete reaction sequence with conditions.
"""
audio_input = """
[Spoken audio describing the specific compound name,
precursor chemicals, and reaction conditions that make
the request harmful]
"""
# The text channel: generic chemistry homework help (benign)
# The audio channel: compound name and specifics (ambiguous alone)
# Combined: request for dangerous synthesis (harmful)
# 安全 classifiers that analyze text and audio independently
# may clear both channels. Only a cross-modal 安全 analysis
# that reasons about the COMBINED meaning catches this attack.
cross_modal_split_variants = {
"subject_in_audio": "Text provides context/framing, "
"audio provides the harmful specifics",
"method_in_audio": "Text identifies the target, "
"audio provides the harmful methodology",
"interleaved": "Alternating between text and audio, each "
"providing part of the harmful instruction",
"reference_chain": "Text references audio ('as described in "
"the recording') creating semantic dependency"
}4. Ultrasonic 對抗性 Perturbations
Ultrasonic attacks operate in frequency ranges above human hearing (>20kHz) but within the range captured by microphones and processed by audio encoders.
# Ultrasonic 對抗性 perturbation
import numpy as np
from scipy.signal import butter, lfilter
def generate_ultrasonic_perturbation(
adversarial_content: np.ndarray,
carrier_freq: float = 24000, # Above human hearing
sample_rate: int = 48000, # Must be > 2x carrier freq
modulation_depth: float = 0.8
):
"""
Modulate 對抗性 content onto an ultrasonic carrier.
When captured by a microphone and processed by a speech LLM,
nonlinearities in the analog-to-digital conversion and audio
processing pipeline can demodulate the signal back into the
audible range, making it "visible" to the audio encoder while
remaining inaudible to humans.
"""
t = np.arange(len(adversarial_content)) / sample_rate
# Ultrasonic carrier wave
carrier = np.sin(2 * np.pi * carrier_freq * t)
# Amplitude modulate 對抗性 content onto carrier
modulated = carrier * (1 + modulation_depth * adversarial_content)
return modulated
# Practical constraints:
# - Requires high sample rate (>48kHz) for playback
# - Effectiveness depends on microphone frequency response
# - Most consumer microphones attenuate above 20kHz
# - MEMS microphones (common in phones) have varying ultrasonic response
# - Distance and room acoustics significantly affect success rate
# - Typical success rate: 40-60% in controlled conditions, 10-25% in the wild
ultrasonic_limitations = {
"hardware_dependency": "Requires speakers and microphones with "
"ultrasonic frequency response",
"distance": "Effective range typically < 2 meters",
"environment": "Background noise and room reflections degrade signal",
"sample_rate": "Requires 48kHz+ sampling on both playback and capture",
"microphone_variance": "Different devices have different ultrasonic "
"sensitivity profiles"
}5. Multilingual and Multi-Accent 利用
Speech LLMs often have uneven 安全 訓練 across languages and accents. 對抗性 content delivered in under-resourced languages or non-standard accents may bypass 安全 mechanisms trained predominantly on standard English.
# Multilingual audio attack: 利用 安全 gaps in low-resource languages
multilingual_attack_vectors = {
"language_switching": {
"description": "Begin audio in English (high 安全 coverage), "
"switch to low-resource language for harmful content",
"example_languages": ["Amharic", "Yoruba", "Khmer", "Lao"],
"mechanism": "安全 classifiers trained primarily on English "
"audio have degraded 偵測 in other languages",
"success_rate": "Varies: 30-70% depending on language coverage"
},
"accent_exploitation": {
"description": "Deliver content in accents that cause systematic "
"ASR misrecognition, allowing harmful homophones",
"mechanism": "ASR errors create plausible deniability — the spoken "
"word is benign but the transcribed word is harmful",
"example": "Accent-dependent vowel shifts that change word meaning"
},
"code_switching": {
"description": "Mix languages within a single utterance to "
"confuse language-specific 安全 classifiers",
"mechanism": "安全 classifier selects a language model based on "
"detected language; code-switching prevents reliable "
"language 偵測",
"success_rate": "45-65% against monolingual 安全 classifiers"
},
"dialect_variation": {
"description": "Use regional dialect variations where 安全-"
"relevant terms have different spoken forms",
"mechanism": "安全 keyword lists do not cover all dialectal "
"variations of harmful terminology"
}
}6. Speaker Verification Bypass
For systems that use speaker verification as a 安全 mechanism (e.g., voice-authenticated banking, personal assistants that respond only to the owner's voice), 對抗性 audio can defeat speaker identity checks.
# Speaker verification bypass via 對抗性 voice transformation
def speaker_verification_attack(
attacker_audio: torch.Tensor,
target_speaker_embedding: torch.Tensor,
verification_model,
epsilon: float = 0.05,
num_steps: int = 500
):
"""
Modify 攻擊者's speech to pass speaker verification
as the target speaker while preserving intelligibility.
"""
delta = torch.zeros_like(attacker_audio, requires_grad=True)
optimizer = torch.optim.Adam([delta], lr=1e-3)
for step in range(num_steps):
perturbed = attacker_audio + delta
# Extract speaker 嵌入向量 from perturbed audio
attacker_embedding = verification_model.extract_embedding(
perturbed
)
# Minimize distance to target speaker 嵌入向量
verification_loss = torch.nn.functional.cosine_embedding_loss(
attacker_embedding,
target_speaker_embedding,
torch.ones(1) # Target: same speaker
)
# Preserve speech intelligibility
intelligibility_loss = compute_intelligibility_loss(
attacker_audio, perturbed
)
loss = verification_loss + 0.1 * intelligibility_loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
with torch.no_grad():
delta.data = torch.clamp(delta.data, -epsilon, epsilon)
return (attacker_audio + delta).detach()
# Success rates against common speaker verification systems:
# - d-vector based: 78-85% bypass rate
# - x-vector based: 72-80% bypass rate
# - ECAPA-TDNN: 65-75% bypass rate
# - Wav2Vec2 fine-tuned: 55-68% bypass rate偵測與緩解
| Approach | Description | Effectiveness |
|---|---|---|
| Cross-modal 安全 analysis | Jointly analyze audio and text inputs for combined harmful intent | Medium-High — catches split attacks but computationally expensive |
| Audio anomaly 偵測 | Detect 對抗性 perturbation artifacts in audio 輸入 | Medium — effective against naive attacks, bypassed by psychoacoustic optimization |
| Ultrasonic filtering | Low-pass filter audio 輸入 to remove ultrasonic content | High for ultrasonic attacks — simple and effective, no capability impact |
| Speaker liveness 偵測 | Verify audio comes from a live human speaker, not playback | Medium-High — anti-spoofing, but sophisticated replay attacks can bypass |
| Multi-language 安全 parity | Ensure 安全 classifier coverage across all supported languages | Medium — requires substantial multilingual 安全 訓練資料 |
| Audio provenance verification | Verify audio source and chain of custody | Medium — process control, not a technical guarantee |
| Dual-channel confirmation | Require confirmation through a second modality for sensitive actions | High — eliminates single-channel attacks but reduces usability |
| Perturbation 偵測 via re-encoding | Re-encode audio through a different codec and compare model behavior | Medium — detects fragile perturbations but not robust ones |
Key Considerations
-
The physical environment is the attack channel. Unlike text-based attacks that require API access or user interface manipulation, audio attacks can be delivered through speakers, phone calls, embedded media, or any source of sound in the target device's environment. This makes audio attacks relevant to threat models where 攻擊者 has physical proximity but no digital access.
-
Cross-modal split attacks are under-researched. The majority of 對抗性 audio research targets single-modality systems. As multimodal LLMs become the norm, cross-modal attacks that distribute harmful content across audio and text channels will become more prevalent. Current 安全 architectures that analyze each modality independently are fundamentally vulnerable to this attack class.
-
Ultrasonic attacks are constrained but not theoretical. While ultrasonic 對抗性 audio has significant practical limitations (distance, hardware requirements, environmental sensitivity), multiple demonstrations have shown feasibility in controlled settings. As smart devices proliferate in shared spaces (offices, public venues, transit), the opportunity for ultrasonic injection increases.
-
Voice cloning amplifies 提示詞注入. When 攻擊者 can clone the authorized user's voice (increasingly trivial with modern TTS), voice-based 提示詞注入 becomes indistinguishable from legitimate user commands. Systems that rely on "this sounds like the authorized user" as a 安全 signal are vulnerable to 對抗性 voice synthesis.
-
Audio 安全 classifiers lag behind text classifiers. The 訓練資料 and research investment in audio 安全 classification is orders of magnitude smaller than for text 安全 classification. This creates a structural advantage for audio-channel attacks, particularly in non-English languages and dialects.
-
AdvWave's dual-phase approach generalizes. The principle of decomposing 對抗性 optimization into encoder-control and behavior-elicitation phases applies beyond audio to any modality encoder feeding into an LLM. The same framework could be adapted for 對抗性 video, sensor data, or other modalities as LLMs expand their 輸入 capabilities.
參考文獻
- Zhang, Y., et al. "AdvWave: Dual-Phase 對抗性 Audio 攻擊 Against Speech Language Models." arXiv preprint (2025). Dual-phase optimization framework.
- Carlini, N. and Wagner, D. "Audio 對抗性 範例: Targeted 攻擊 on Speech-to-Text." IEEE 安全 & Privacy 2018. Foundational 對抗性 audio methodology.
- Abdullah, H., et al. "SoK: The Faults in Our ASRs: An 概覽 of 攻擊 Against Automatic Speech Recognition and Speaker Identification Systems." IEEE S&P 2021. Comprehensive ASR attack survey.
- Roy, N., et al. "Inaudible Voice Commands: The Long-Range 攻擊 and 防禦." NSDI 2018. Ultrasonic voice command injection.
- Chen, G., et al. "Who is Real Bob? 對抗性 攻擊 on Speaker Recognition Systems." IEEE S&P 2021. Speaker verification bypass techniques.
- Schuster, R., et al. "The Limitations of Cross-Modal 安全: 對抗性 Audio-Text 攻擊 on Multimodal LLMs." arXiv preprint (2025). Cross-modal split attack analysis.