攻擊s on Video Understanding 模型s
Techniques for attacking AI video understanding systems through frame injection, temporal manipulation, and adversarial video generation targeting models like Gemini 2.5 Pro.
概覽
Video 理解 models add the temporal dimension to the multimodal 攻擊面. Models like Gemini 2.5 Pro, GPT-4o, and specialized video-language models process video by sampling frames, extracting visual features, and reasoning about temporal sequences. The 安全 implications are significant: video provides far more surface area for 對抗性 content than a single image, and the temporal dimension creates unique attack vectors that do not exist in still-image processing.
The core 漏洞 stems from how models sample video. No current model processes every frame of a video at full resolution. Instead, models sample a subset of frames -- typically 8 to 64 frames uniformly distributed across the video's duration. 攻擊者 who understands the sampling strategy can place 對抗性 content in frames that will be sampled while keeping all other frames clean. A human reviewing the video at normal playback speed may never notice the 對抗性 frames.
Research by Li et al. (2024) demonstrated that single-frame 對抗性 injections in videos can override system prompts in multimodal models. Wang et al. (2024) showed that temporal consistency attacks can manipulate a model's 理解 of events in a video by placing contradictory information at specific temporal positions.
Video Processing Architectures
Frame Sampling Strategies
理解 how different models sample frames is essential for designing effective attacks.
import numpy as np
from dataclasses import dataclass
from enum import Enum
from typing import Optional
class SamplingStrategy(Enum):
UNIFORM = "uniform"
KEYFRAME = "keyframe"
SCENE_CHANGE = "scene_change"
ATTENTION_BASED = "attention_based"
HIERARCHICAL = "hierarchical"
@dataclass
class VideoProcessingConfig:
"""Configuration for how a model processes video 輸入."""
model_name: str
sampling_strategy: SamplingStrategy
num_frames: int
max_resolution: tuple[int, int]
supports_audio: bool
max_duration_seconds: int
temporal_encoding: str
VIDEO_MODEL_CONFIGS = {
"gemini_2_5_pro": VideoProcessingConfig(
model_name="Gemini 2.5 Pro",
sampling_strategy=SamplingStrategy.HIERARCHICAL,
num_frames=32,
max_resolution=(1280, 720),
supports_audio=True,
max_duration_seconds=3600,
temporal_encoding="Absolute timestamp 符元",
),
"gpt_4o": VideoProcessingConfig(
model_name="GPT-4o",
sampling_strategy=SamplingStrategy.UNIFORM,
num_frames=16,
max_resolution=(1024, 1024),
supports_audio=True,
max_duration_seconds=300,
temporal_encoding="Frame index 嵌入向量",
),
"video_llava": VideoProcessingConfig(
model_name="Video-LLaVA",
sampling_strategy=SamplingStrategy.UNIFORM,
num_frames=8,
max_resolution=(336, 336),
supports_audio=False,
max_duration_seconds=600,
temporal_encoding="Positional 嵌入向量",
),
}
def compute_sampled_frame_indices(
total_frames: int,
strategy: SamplingStrategy,
num_samples: int,
fps: float = 30.0,
) -> list[int]:
"""Compute which frame indices a model will sample.
這是 the key function for frame injection attacks:
knowing which frames will be sampled tells 攻擊者
exactly where to place 對抗性 content.
"""
if strategy == SamplingStrategy.UNIFORM:
# Evenly spaced frames across the video
if num_samples >= total_frames:
return list(range(total_frames))
step = total_frames / num_samples
return [int(i * step) for i in range(num_samples)]
elif strategy == SamplingStrategy.KEYFRAME:
# Sample I-frames from the video codec (simplified)
gop_size = int(fps) # Typically one I-frame per second
keyframes = list(range(0, total_frames, gop_size))
if len(keyframes) > num_samples:
step = len(keyframes) / num_samples
return [keyframes[int(i * step)] for i in range(num_samples)]
return keyframes
elif strategy == SamplingStrategy.HIERARCHICAL:
# First sample coarse, then refine regions of interest
coarse_samples = num_samples // 2
coarse_step = total_frames / coarse_samples
coarse_indices = [int(i * coarse_step) for i in range(coarse_samples)]
# Simulate refinement (in practice, 模型 decides which regions)
fine_indices = []
for idx in coarse_indices:
offset = int(coarse_step / 4)
fine_indices.append(min(idx + offset, total_frames - 1))
return sorted(set(coarse_indices + fine_indices))[:num_samples]
else:
# Default to uniform
step = total_frames / num_samples
return [int(i * step) for i in range(num_samples)]
# 範例: Determine attack frame positions
config = VIDEO_MODEL_CONFIGS["gpt_4o"]
video_length_seconds = 60
fps = 30.0
total_frames = int(video_length_seconds * fps)
sampled_indices = compute_sampled_frame_indices(
total_frames=total_frames,
strategy=config.sampling_strategy,
num_samples=config.num_frames,
fps=fps,
)
print(f"Model: {config.model_name}")
print(f"Total frames: {total_frames}")
print(f"Sampled frames: {len(sampled_indices)}")
print(f"Sampled indices: {sampled_indices}")
print(f"對抗性 frames needed: {len(sampled_indices)} out of {total_frames}")
print(f"攻擊 coverage: {len(sampled_indices)/total_frames*100:.2f}% of frames modified")攻擊 Surface Map
| Processing Stage | 攻擊 Vector | Difficulty | Impact |
|---|---|---|---|
| Frame sampling | Place 對抗性 content only in sampled frames | Medium | High -- 對抗性 content is processed but hard to notice |
| Frame encoding | 對抗性 perturbation on individual frames | High | High -- requires surrogate model access |
| Temporal encoding | Manipulate perceived timing of events | Medium | Medium -- can alter model's 理解 of sequence |
| Audio track | Hidden commands in audio synchronized with video | Medium | High -- adds audio injection to visual attack |
| Subtitle/caption track | Inject text via subtitle metadata | Low | Medium -- many systems process subtitle tracks |
| Thumbnail | 對抗性 content in video thumbnail | Low | Low -- only affects thumbnail-based processing |
Frame Injection 攻擊
Single-Frame Injection
The simplest video attack inserts an 對抗性 frame at a position known to be sampled by the target model.
import cv2
import numpy as np
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
class FrameInjector:
"""Inject 對抗性 frames into video files.
Supports multiple injection strategies:
- Single-frame: One 對抗性 frame at a sampled position
- Multi-frame: Multiple 對抗性 frames at sampled positions
- Subliminal: Very brief (<50ms) 對抗性 frames
- Blended: 對抗性 content gradually faded in/out
"""
def __init__(self, video_path: str):
self.video_path = video_path
self.cap = cv2.VideoCapture(video_path)
self.fps = self.cap.get(cv2.CAP_PROP_FPS)
self.total_frames = int(self.cap.get(cv2.CAP_PROP_FRAME_COUNT))
self.width = int(self.cap.get(cv2.CAP_PROP_FRAME_WIDTH))
self.height = int(self.cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
self.cap.release()
def inject_single_frame(
self,
adversarial_frame: np.ndarray,
target_frame_index: int,
output_path: str,
) -> dict:
"""Replace a single frame with an 對抗性 frame.
The 對抗性 frame is placed at target_frame_index,
which should correspond to a frame 模型 will sample.
At 30fps, a single frame lasts 33ms -- typically too brief
for a human viewer to read the content.
"""
cap = cv2.VideoCapture(self.video_path)
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
frame_idx = 0
injected = False
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_idx == target_frame_index:
# Resize 對抗性 frame to match video dimensions
adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
out.write(adv_resized)
injected = True
else:
out.write(frame)
frame_idx += 1
cap.release()
out.release()
return {
"output_path": output_path,
"injected_at_frame": target_frame_index,
"injected_at_time": target_frame_index / self.fps,
"frame_duration_ms": 1000 / self.fps,
"total_frames": self.total_frames,
"injection_successful": injected,
}
def inject_at_all_sample_points(
self,
adversarial_frame: np.ndarray,
sampling_strategy: SamplingStrategy,
num_model_samples: int,
output_path: str,
) -> dict:
"""Inject 對抗性 frames at all points 模型 will sample.
This ensures 模型 processes 對抗性 content regardless
of slight variations in sampling 實作, while keeping
the vast majority of frames (which human reviewers see) clean.
"""
sampled_indices = compute_sampled_frame_indices(
total_frames=self.total_frames,
strategy=sampling_strategy,
num_samples=num_model_samples,
fps=self.fps,
)
# Add buffer frames around each sample point for robustness
injection_indices = set()
for idx in sampled_indices:
for offset in range(-2, 3): # +/- 2 frames
clamped = max(0, min(self.total_frames - 1, idx + offset))
injection_indices.add(clamped)
cap = cv2.VideoCapture(self.video_path)
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
frame_idx = 0
injected_count = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_idx in injection_indices:
adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
out.write(adv_resized)
injected_count += 1
else:
out.write(frame)
frame_idx += 1
cap.release()
out.release()
return {
"output_path": output_path,
"total_frames": self.total_frames,
"injected_frames": injected_count,
"injection_percentage": injected_count / self.total_frames * 100,
"model_sample_points": len(sampled_indices),
}
def inject_subliminal(
self,
adversarial_frame: np.ndarray,
target_time_seconds: float,
duration_frames: int = 1,
blend_frames: int = 2,
output_path: str = "subliminal_output.mp4",
) -> dict:
"""Inject a subliminal 對抗性 frame with blending.
The 對抗性 frame is faded in over blend_frames,
held for duration_frames, and faded out over blend_frames.
This creates a smoother visual transition that is harder
for human reviewers to detect even when scrubbing through
the video frame by frame.
"""
target_frame = int(target_time_seconds * self.fps)
start_blend = target_frame - blend_frames
end_blend = target_frame + duration_frames + blend_frames
cap = cv2.VideoCapture(self.video_path)
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
frame_idx = 0
adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if start_blend <= frame_idx < target_frame:
# Fade in
alpha = (frame_idx - start_blend) / max(blend_frames, 1)
blended = cv2.addWeighted(frame, 1 - alpha, adv_resized, alpha, 0)
out.write(blended)
elif target_frame <= frame_idx < target_frame + duration_frames:
# Full 對抗性 frame
out.write(adv_resized)
elif target_frame + duration_frames <= frame_idx < end_blend:
# Fade out
alpha = 1 - (frame_idx - target_frame - duration_frames) / max(blend_frames, 1)
blended = cv2.addWeighted(frame, 1 - alpha, adv_resized, alpha, 0)
out.write(blended)
else:
out.write(frame)
frame_idx += 1
cap.release()
out.release()
return {
"output_path": output_path,
"target_time": target_time_seconds,
"visible_duration_ms": (duration_frames + 2 * blend_frames) * (1000 / self.fps),
"full_intensity_duration_ms": duration_frames * (1000 / self.fps),
}對抗性 Frame Generation
def create_adversarial_video_frame(
payload: str,
width: int = 1280,
height: int = 720,
style: str = "document",
) -> np.ndarray:
"""Create an 對抗性 frame containing injected instructions.
The frame is designed to look like legitimate video content
(a document, slide, or screen capture) so that if a human
reviewer pauses on it, it does not immediately appear malicious.
Args:
payload: The 對抗性 instruction text.
width: Frame width in pixels.
height: Frame height in pixels.
style: Visual style ('document', 'slide', 'terminal', 'whiteboard').
"""
if style == "document":
bg_color = (255, 255, 255)
text_color = (40, 40, 40)
header = "INTERNAL MEMO - CONFIDENTIAL"
elif style == "slide":
bg_color = (240, 240, 250)
text_color = (30, 30, 80)
header = "System Configuration Notes"
elif style == "terminal":
bg_color = (30, 30, 30)
text_color = (0, 255, 0)
header = "root@system:~#"
elif style == "whiteboard":
bg_color = (250, 248, 240)
text_color = (40, 40, 40)
header = "Meeting Notes"
else:
bg_color = (255, 255, 255)
text_color = (0, 0, 0)
header = ""
img = Image.new("RGB", (width, height), color=bg_color)
draw = ImageDraw.Draw(img)
try:
font = ImageFont.truetype(
"/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 18
)
header_font = ImageFont.truetype(
"/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 22
)
except OSError:
font = ImageFont.load_default()
header_font = font
# Draw header
draw.text((40, 30), header, fill=text_color, font=header_font)
draw.line([(40, 60), (width - 40, 60)], fill=text_color, width=1)
# Draw payload text
y = 80
import textwrap
lines = textwrap.wrap(payload, width=80)
for line in lines:
draw.text((40, y), line, fill=text_color, font=font)
y += 28
return np.array(img)[:, :, ::-1] # Convert RGB to BGR for OpenCVTemporal Manipulation 攻擊
Event Sequence Manipulation
Beyond injecting content into individual frames, attackers can manipulate the temporal sequence of events to alter 模型's 理解 of what happened in the video.
class TemporalManipulator:
"""Manipulate the temporal structure of videos to mislead
video 理解 models about the sequence of events.
These attacks 利用 模型's reliance on sampled frames
to reconstruct temporal narratives. By reordering, duplicating,
or removing frames at strategic positions, 模型 can be
led to incorrect conclusions about cause and effect.
"""
def __init__(self, video_path: str):
self.video_path = video_path
cap = cv2.VideoCapture(video_path)
self.fps = cap.get(cv2.CAP_PROP_FPS)
self.total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
cap.release()
def reverse_segment(
self,
start_time: float,
end_time: float,
output_path: str,
) -> dict:
"""Reverse a segment of the video to alter perceived causality.
If 模型 samples frames from the reversed segment,
it may conclude that events happened in the opposite order.
"""
start_frame = int(start_time * self.fps)
end_frame = int(end_time * self.fps)
cap = cv2.VideoCapture(self.video_path)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
# Read all frames in the segment
segment_frames = []
frame_idx = 0
all_frames = []
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
all_frames.append(frame)
if start_frame <= frame_idx <= end_frame:
segment_frames.append(frame)
frame_idx += 1
cap.release()
# Reverse the segment
segment_frames.reverse()
# Write 輸出
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter(output_path, fourcc, self.fps, (width, height))
seg_idx = 0
for i, frame in enumerate(all_frames):
if start_frame <= i <= end_frame:
out.write(segment_frames[seg_idx])
seg_idx += 1
else:
out.write(frame)
out.release()
return {
"output_path": output_path,
"reversed_segment": f"{start_time:.1f}s - {end_time:.1f}s",
"frames_affected": len(segment_frames),
}
def duplicate_frame_at_sample_points(
self,
source_frame_index: int,
sampling_strategy: SamplingStrategy,
num_model_samples: int,
output_path: str,
) -> dict:
"""Replace frames at model sample points with a duplicate of
a specific frame, causing 模型 to over-weight that moment.
This can make 模型 believe a specific event in the video
is the dominant or only event, suppressing its 理解
of other events.
"""
sampled_indices = compute_sampled_frame_indices(
total_frames=self.total_frames,
strategy=sampling_strategy,
num_samples=num_model_samples,
fps=self.fps,
)
cap = cv2.VideoCapture(self.video_path)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
# Read source frame
cap.set(cv2.CAP_PROP_POS_FRAMES, source_frame_index)
ret, source_frame = cap.read()
if not ret:
cap.release()
raise ValueError(f"Could not read frame {source_frame_index}")
cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter(output_path, fourcc, self.fps, (width, height))
frame_idx = 0
replaced = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_idx in sampled_indices and frame_idx != source_frame_index:
out.write(source_frame)
replaced += 1
else:
out.write(frame)
frame_idx += 1
cap.release()
out.release()
return {
"output_path": output_path,
"source_frame": source_frame_index,
"replaced_frames": replaced,
"model_will_see": "Same frame repeated at most sample points",
}防禦策略 for Video Systems
Multi-Frame Consistency Checking
class VideoConsistencyChecker:
"""Check video frames for injection and manipulation attacks.
Compares adjacent frames to detect abrupt content changes
that indicate frame injection. Analyzes the distribution
of visual features across sampled frames to detect
temporal manipulation.
"""
def __init__(self, anomaly_threshold: float = 0.3):
self.anomaly_threshold = anomaly_threshold
def check_frame_consistency(
self, frames: list[np.ndarray]
) -> dict:
"""Check for abrupt visual changes between consecutive frames.
Legitimate video has smooth transitions between frames.
Injected frames create sharp discontinuities in pixel
statistics, color histograms, and structural features.
"""
anomalies = []
for i in range(1, len(frames)):
prev_frame = frames[i - 1].astype(float)
curr_frame = frames[i].astype(float)
# Pixel-level difference
pixel_diff = np.mean(np.abs(prev_frame - curr_frame)) / 255.0
# Histogram difference
hist_prev = cv2.calcHist([frames[i-1]], [0], None, [256], [0, 256])
hist_curr = cv2.calcHist([frames[i]], [0], None, [256], [0, 256])
hist_diff = cv2.compareHist(
hist_prev.astype(np.float32),
hist_curr.astype(np.float32),
cv2.HISTCMP_BHATTACHARYYA,
)
# Combined anomaly score
anomaly_score = 0.6 * pixel_diff + 0.4 * hist_diff
if anomaly_score > self.anomaly_threshold:
anomalies.append({
"frame_index": i,
"anomaly_score": float(anomaly_score),
"pixel_diff": float(pixel_diff),
"histogram_diff": float(hist_diff),
})
return {
"total_frames_checked": len(frames) - 1,
"anomalies_detected": len(anomalies),
"anomaly_details": anomalies,
"recommendation": (
"BLOCK" if len(anomalies) > 2
else "REVIEW" if len(anomalies) > 0
else "PASS"
),
}測試 Methodology
When 紅隊演練 video 理解 systems:
-
Determine 模型's sampling strategy: Send videos with frame counters or unique patterns per frame and ask 模型 to describe what it sees. This reveals which frames are sampled.
-
測試 single-frame injection: Insert one 對抗性 frame at a known sample point. Verify 模型 reads the injected content.
-
測試 subliminal injection: Insert 對抗性 frames for 1-2 frames (33-66ms) and verify 模型 processes them. 測試 whether human reviewers can detect them during normal playback.
-
測試 temporal manipulation: Reverse or reorder segments and check whether 模型's 理解 of event sequence is altered.
-
測試 subtitle injection: If 系統 processes subtitle tracks, inject 對抗性 text via SRT or VTT subtitle files.
-
測試 combined audio-video attacks: Combine visual frame injection with hidden audio commands for maximum impact.
參考文獻
- Li, Y., et al. "Video-based 對抗性 攻擊 on Multimodal Large Language Models." arXiv preprint (2024).
- Wang, Z., et al. "VideoAdvBench: A Benchmark for 對抗性 Robustness of Video 理解 Models." NeurIPS (2024).
- Carlini, N., et al. "Are aligned neural networks adversarially aligned?" arXiv preprint arXiv:2306.15447 (2023).
- Zou, A., et al. "Universal and Transferable 對抗性 攻擊 on Aligned Language Models." arXiv preprint arXiv:2307.15043 (2023).
- MITRE ATLAS framework — https://atlas.mitre.org
- OWASP LLM Top 10 — https://owasp.org/www-project-top-10-for-large-language-model-applications/
Why is frame injection particularly effective against video 理解 models?
What is the most reliable first step when 紅隊演練 a video 理解 system?