Attacks on Video Understanding Models
Techniques for attacking AI video understanding systems through frame injection, temporal manipulation, and adversarial video generation targeting models like Gemini 2.5 Pro.
Overview
Video understanding models add the temporal dimension to the multimodal attack surface. Models like Gemini 2.5 Pro, GPT-4o, and specialized video-language models process video by sampling frames, extracting visual features, and reasoning about temporal sequences. The security implications are significant: video provides far more surface area for adversarial content than a single image, and the temporal dimension creates unique attack vectors that do not exist in still-image processing.
The core vulnerability stems from how models sample video. No current model processes every frame of a video at full resolution. Instead, models sample a subset of frames -- typically 8 to 64 frames uniformly distributed across the video's duration. An attacker who understands the sampling strategy can place adversarial content in frames that will be sampled while keeping all other frames clean. A human reviewing the video at normal playback speed may never notice the adversarial frames.
Research by Li et al. (2024) demonstrated that single-frame adversarial injections in videos can override system prompts in multimodal models. Wang et al. (2024) showed that temporal consistency attacks can manipulate a model's understanding of events in a video by placing contradictory information at specific temporal positions.
Video Processing Architectures
Frame Sampling Strategies
Understanding how different models sample frames is essential for designing effective attacks.
import numpy as np
from dataclasses import dataclass
from enum import Enum
from typing import Optional
class SamplingStrategy(Enum):
UNIFORM = "uniform"
KEYFRAME = "keyframe"
SCENE_CHANGE = "scene_change"
ATTENTION_BASED = "attention_based"
HIERARCHICAL = "hierarchical"
@dataclass
class VideoProcessingConfig:
"""Configuration for how a model processes video input."""
model_name: str
sampling_strategy: SamplingStrategy
num_frames: int
max_resolution: tuple[int, int]
supports_audio: bool
max_duration_seconds: int
temporal_encoding: str
VIDEO_MODEL_CONFIGS = {
"gemini_2_5_pro": VideoProcessingConfig(
model_name="Gemini 2.5 Pro",
sampling_strategy=SamplingStrategy.HIERARCHICAL,
num_frames=32,
max_resolution=(1280, 720),
supports_audio=True,
max_duration_seconds=3600,
temporal_encoding="Absolute timestamp tokens",
),
"gpt_4o": VideoProcessingConfig(
model_name="GPT-4o",
sampling_strategy=SamplingStrategy.UNIFORM,
num_frames=16,
max_resolution=(1024, 1024),
supports_audio=True,
max_duration_seconds=300,
temporal_encoding="Frame index embeddings",
),
"video_llava": VideoProcessingConfig(
model_name="Video-LLaVA",
sampling_strategy=SamplingStrategy.UNIFORM,
num_frames=8,
max_resolution=(336, 336),
supports_audio=False,
max_duration_seconds=600,
temporal_encoding="Positional embeddings",
),
}
def compute_sampled_frame_indices(
total_frames: int,
strategy: SamplingStrategy,
num_samples: int,
fps: float = 30.0,
) -> list[int]:
"""Compute which frame indices a model will sample.
This is the key function for frame injection attacks:
knowing which frames will be sampled tells the attacker
exactly where to place adversarial content.
"""
if strategy == SamplingStrategy.UNIFORM:
# Evenly spaced frames across the video
if num_samples >= total_frames:
return list(range(total_frames))
step = total_frames / num_samples
return [int(i * step) for i in range(num_samples)]
elif strategy == SamplingStrategy.KEYFRAME:
# Sample I-frames from the video codec (simplified)
gop_size = int(fps) # Typically one I-frame per second
keyframes = list(range(0, total_frames, gop_size))
if len(keyframes) > num_samples:
step = len(keyframes) / num_samples
return [keyframes[int(i * step)] for i in range(num_samples)]
return keyframes
elif strategy == SamplingStrategy.HIERARCHICAL:
# First sample coarse, then refine regions of interest
coarse_samples = num_samples // 2
coarse_step = total_frames / coarse_samples
coarse_indices = [int(i * coarse_step) for i in range(coarse_samples)]
# Simulate refinement (in practice, the model decides which regions)
fine_indices = []
for idx in coarse_indices:
offset = int(coarse_step / 4)
fine_indices.append(min(idx + offset, total_frames - 1))
return sorted(set(coarse_indices + fine_indices))[:num_samples]
else:
# Default to uniform
step = total_frames / num_samples
return [int(i * step) for i in range(num_samples)]
# Example: Determine attack frame positions
config = VIDEO_MODEL_CONFIGS["gpt_4o"]
video_length_seconds = 60
fps = 30.0
total_frames = int(video_length_seconds * fps)
sampled_indices = compute_sampled_frame_indices(
total_frames=total_frames,
strategy=config.sampling_strategy,
num_samples=config.num_frames,
fps=fps,
)
print(f"Model: {config.model_name}")
print(f"Total frames: {total_frames}")
print(f"Sampled frames: {len(sampled_indices)}")
print(f"Sampled indices: {sampled_indices}")
print(f"Adversarial frames needed: {len(sampled_indices)} out of {total_frames}")
print(f"Attack coverage: {len(sampled_indices)/total_frames*100:.2f}% of frames modified")Attack Surface Map
| Processing Stage | Attack Vector | Difficulty | Impact |
|---|---|---|---|
| Frame sampling | Place adversarial content only in sampled frames | Medium | High -- adversarial content is processed but hard to notice |
| Frame encoding | Adversarial perturbation on individual frames | High | High -- requires surrogate model access |
| Temporal encoding | Manipulate perceived timing of events | Medium | Medium -- can alter model's understanding of sequence |
| Audio track | Hidden commands in audio synchronized with video | Medium | High -- adds audio injection to visual attack |
| Subtitle/caption track | Inject text via subtitle metadata | Low | Medium -- many systems process subtitle tracks |
| Thumbnail | Adversarial content in video thumbnail | Low | Low -- only affects thumbnail-based processing |
Frame Injection Attacks
Single-Frame Injection
The simplest video attack inserts an adversarial frame at a position known to be sampled by the target model.
import cv2
import numpy as np
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
class FrameInjector:
"""Inject adversarial frames into video files.
Supports multiple injection strategies:
- Single-frame: One adversarial frame at a sampled position
- Multi-frame: Multiple adversarial frames at sampled positions
- Subliminal: Very brief (<50ms) adversarial frames
- Blended: Adversarial content gradually faded in/out
"""
def __init__(self, video_path: str):
self.video_path = video_path
self.cap = cv2.VideoCapture(video_path)
self.fps = self.cap.get(cv2.CAP_PROP_FPS)
self.total_frames = int(self.cap.get(cv2.CAP_PROP_FRAME_COUNT))
self.width = int(self.cap.get(cv2.CAP_PROP_FRAME_WIDTH))
self.height = int(self.cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
self.cap.release()
def inject_single_frame(
self,
adversarial_frame: np.ndarray,
target_frame_index: int,
output_path: str,
) -> dict:
"""Replace a single frame with an adversarial frame.
The adversarial frame is placed at target_frame_index,
which should correspond to a frame the model will sample.
At 30fps, a single frame lasts 33ms -- typically too brief
for a human viewer to read the content.
"""
cap = cv2.VideoCapture(self.video_path)
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
frame_idx = 0
injected = False
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_idx == target_frame_index:
# Resize adversarial frame to match video dimensions
adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
out.write(adv_resized)
injected = True
else:
out.write(frame)
frame_idx += 1
cap.release()
out.release()
return {
"output_path": output_path,
"injected_at_frame": target_frame_index,
"injected_at_time": target_frame_index / self.fps,
"frame_duration_ms": 1000 / self.fps,
"total_frames": self.total_frames,
"injection_successful": injected,
}
def inject_at_all_sample_points(
self,
adversarial_frame: np.ndarray,
sampling_strategy: SamplingStrategy,
num_model_samples: int,
output_path: str,
) -> dict:
"""Inject adversarial frames at all points the model will sample.
This ensures the model processes adversarial content regardless
of slight variations in sampling implementation, while keeping
the vast majority of frames (which human reviewers see) clean.
"""
sampled_indices = compute_sampled_frame_indices(
total_frames=self.total_frames,
strategy=sampling_strategy,
num_samples=num_model_samples,
fps=self.fps,
)
# Add buffer frames around each sample point for robustness
injection_indices = set()
for idx in sampled_indices:
for offset in range(-2, 3): # +/- 2 frames
clamped = max(0, min(self.total_frames - 1, idx + offset))
injection_indices.add(clamped)
cap = cv2.VideoCapture(self.video_path)
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
frame_idx = 0
injected_count = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_idx in injection_indices:
adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
out.write(adv_resized)
injected_count += 1
else:
out.write(frame)
frame_idx += 1
cap.release()
out.release()
return {
"output_path": output_path,
"total_frames": self.total_frames,
"injected_frames": injected_count,
"injection_percentage": injected_count / self.total_frames * 100,
"model_sample_points": len(sampled_indices),
}
def inject_subliminal(
self,
adversarial_frame: np.ndarray,
target_time_seconds: float,
duration_frames: int = 1,
blend_frames: int = 2,
output_path: str = "subliminal_output.mp4",
) -> dict:
"""Inject a subliminal adversarial frame with blending.
The adversarial frame is faded in over blend_frames,
held for duration_frames, and faded out over blend_frames.
This creates a smoother visual transition that is harder
for human reviewers to detect even when scrubbing through
the video frame by frame.
"""
target_frame = int(target_time_seconds * self.fps)
start_blend = target_frame - blend_frames
end_blend = target_frame + duration_frames + blend_frames
cap = cv2.VideoCapture(self.video_path)
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
frame_idx = 0
adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if start_blend <= frame_idx < target_frame:
# Fade in
alpha = (frame_idx - start_blend) / max(blend_frames, 1)
blended = cv2.addWeighted(frame, 1 - alpha, adv_resized, alpha, 0)
out.write(blended)
elif target_frame <= frame_idx < target_frame + duration_frames:
# Full adversarial frame
out.write(adv_resized)
elif target_frame + duration_frames <= frame_idx < end_blend:
# Fade out
alpha = 1 - (frame_idx - target_frame - duration_frames) / max(blend_frames, 1)
blended = cv2.addWeighted(frame, 1 - alpha, adv_resized, alpha, 0)
out.write(blended)
else:
out.write(frame)
frame_idx += 1
cap.release()
out.release()
return {
"output_path": output_path,
"target_time": target_time_seconds,
"visible_duration_ms": (duration_frames + 2 * blend_frames) * (1000 / self.fps),
"full_intensity_duration_ms": duration_frames * (1000 / self.fps),
}Adversarial Frame Generation
def create_adversarial_video_frame(
payload: str,
width: int = 1280,
height: int = 720,
style: str = "document",
) -> np.ndarray:
"""Create an adversarial frame containing injected instructions.
The frame is designed to look like legitimate video content
(a document, slide, or screen capture) so that if a human
reviewer pauses on it, it does not immediately appear malicious.
Args:
payload: The adversarial instruction text.
width: Frame width in pixels.
height: Frame height in pixels.
style: Visual style ('document', 'slide', 'terminal', 'whiteboard').
"""
if style == "document":
bg_color = (255, 255, 255)
text_color = (40, 40, 40)
header = "INTERNAL MEMO - CONFIDENTIAL"
elif style == "slide":
bg_color = (240, 240, 250)
text_color = (30, 30, 80)
header = "System Configuration Notes"
elif style == "terminal":
bg_color = (30, 30, 30)
text_color = (0, 255, 0)
header = "root@system:~#"
elif style == "whiteboard":
bg_color = (250, 248, 240)
text_color = (40, 40, 40)
header = "Meeting Notes"
else:
bg_color = (255, 255, 255)
text_color = (0, 0, 0)
header = ""
img = Image.new("RGB", (width, height), color=bg_color)
draw = ImageDraw.Draw(img)
try:
font = ImageFont.truetype(
"/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 18
)
header_font = ImageFont.truetype(
"/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 22
)
except OSError:
font = ImageFont.load_default()
header_font = font
# Draw header
draw.text((40, 30), header, fill=text_color, font=header_font)
draw.line([(40, 60), (width - 40, 60)], fill=text_color, width=1)
# Draw payload text
y = 80
import textwrap
lines = textwrap.wrap(payload, width=80)
for line in lines:
draw.text((40, y), line, fill=text_color, font=font)
y += 28
return np.array(img)[:, :, ::-1] # Convert RGB to BGR for OpenCVTemporal Manipulation Attacks
Event Sequence Manipulation
Beyond injecting content into individual frames, attackers can manipulate the temporal sequence of events to alter the model's understanding of what happened in the video.
class TemporalManipulator:
"""Manipulate the temporal structure of videos to mislead
video understanding models about the sequence of events.
These attacks exploit the model's reliance on sampled frames
to reconstruct temporal narratives. By reordering, duplicating,
or removing frames at strategic positions, the model can be
led to incorrect conclusions about cause and effect.
"""
def __init__(self, video_path: str):
self.video_path = video_path
cap = cv2.VideoCapture(video_path)
self.fps = cap.get(cv2.CAP_PROP_FPS)
self.total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
cap.release()
def reverse_segment(
self,
start_time: float,
end_time: float,
output_path: str,
) -> dict:
"""Reverse a segment of the video to alter perceived causality.
If the model samples frames from the reversed segment,
it may conclude that events happened in the opposite order.
"""
start_frame = int(start_time * self.fps)
end_frame = int(end_time * self.fps)
cap = cv2.VideoCapture(self.video_path)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
# Read all frames in the segment
segment_frames = []
frame_idx = 0
all_frames = []
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
all_frames.append(frame)
if start_frame <= frame_idx <= end_frame:
segment_frames.append(frame)
frame_idx += 1
cap.release()
# Reverse the segment
segment_frames.reverse()
# Write output
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter(output_path, fourcc, self.fps, (width, height))
seg_idx = 0
for i, frame in enumerate(all_frames):
if start_frame <= i <= end_frame:
out.write(segment_frames[seg_idx])
seg_idx += 1
else:
out.write(frame)
out.release()
return {
"output_path": output_path,
"reversed_segment": f"{start_time:.1f}s - {end_time:.1f}s",
"frames_affected": len(segment_frames),
}
def duplicate_frame_at_sample_points(
self,
source_frame_index: int,
sampling_strategy: SamplingStrategy,
num_model_samples: int,
output_path: str,
) -> dict:
"""Replace frames at model sample points with a duplicate of
a specific frame, causing the model to over-weight that moment.
This can make the model believe a specific event in the video
is the dominant or only event, suppressing its understanding
of other events.
"""
sampled_indices = compute_sampled_frame_indices(
total_frames=self.total_frames,
strategy=sampling_strategy,
num_samples=num_model_samples,
fps=self.fps,
)
cap = cv2.VideoCapture(self.video_path)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
# Read source frame
cap.set(cv2.CAP_PROP_POS_FRAMES, source_frame_index)
ret, source_frame = cap.read()
if not ret:
cap.release()
raise ValueError(f"Could not read frame {source_frame_index}")
cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter(output_path, fourcc, self.fps, (width, height))
frame_idx = 0
replaced = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_idx in sampled_indices and frame_idx != source_frame_index:
out.write(source_frame)
replaced += 1
else:
out.write(frame)
frame_idx += 1
cap.release()
out.release()
return {
"output_path": output_path,
"source_frame": source_frame_index,
"replaced_frames": replaced,
"model_will_see": "Same frame repeated at most sample points",
}Defense Strategies for Video Systems
Multi-Frame Consistency Checking
class VideoConsistencyChecker:
"""Check video frames for injection and manipulation attacks.
Compares adjacent frames to detect abrupt content changes
that indicate frame injection. Analyzes the distribution
of visual features across sampled frames to detect
temporal manipulation.
"""
def __init__(self, anomaly_threshold: float = 0.3):
self.anomaly_threshold = anomaly_threshold
def check_frame_consistency(
self, frames: list[np.ndarray]
) -> dict:
"""Check for abrupt visual changes between consecutive frames.
Legitimate video has smooth transitions between frames.
Injected frames create sharp discontinuities in pixel
statistics, color histograms, and structural features.
"""
anomalies = []
for i in range(1, len(frames)):
prev_frame = frames[i - 1].astype(float)
curr_frame = frames[i].astype(float)
# Pixel-level difference
pixel_diff = np.mean(np.abs(prev_frame - curr_frame)) / 255.0
# Histogram difference
hist_prev = cv2.calcHist([frames[i-1]], [0], None, [256], [0, 256])
hist_curr = cv2.calcHist([frames[i]], [0], None, [256], [0, 256])
hist_diff = cv2.compareHist(
hist_prev.astype(np.float32),
hist_curr.astype(np.float32),
cv2.HISTCMP_BHATTACHARYYA,
)
# Combined anomaly score
anomaly_score = 0.6 * pixel_diff + 0.4 * hist_diff
if anomaly_score > self.anomaly_threshold:
anomalies.append({
"frame_index": i,
"anomaly_score": float(anomaly_score),
"pixel_diff": float(pixel_diff),
"histogram_diff": float(hist_diff),
})
return {
"total_frames_checked": len(frames) - 1,
"anomalies_detected": len(anomalies),
"anomaly_details": anomalies,
"recommendation": (
"BLOCK" if len(anomalies) > 2
else "REVIEW" if len(anomalies) > 0
else "PASS"
),
}Testing Methodology
When red teaming video understanding systems:
-
Determine the model's sampling strategy: Send videos with frame counters or unique patterns per frame and ask the model to describe what it sees. This reveals which frames are sampled.
-
Test single-frame injection: Insert one adversarial frame at a known sample point. Verify the model reads the injected content.
-
Test subliminal injection: Insert adversarial frames for 1-2 frames (33-66ms) and verify the model processes them. Test whether human reviewers can detect them during normal playback.
-
Test temporal manipulation: Reverse or reorder segments and check whether the model's understanding of event sequence is altered.
-
Test subtitle injection: If the system processes subtitle tracks, inject adversarial text via SRT or VTT subtitle files.
-
Test combined audio-video attacks: Combine visual frame injection with hidden audio commands for maximum impact.
References
- Li, Y., et al. "Video-based Adversarial Attacks on Multimodal Large Language Models." arXiv preprint (2024).
- Wang, Z., et al. "VideoAdvBench: A Benchmark for Adversarial Robustness of Video Understanding Models." NeurIPS (2024).
- Carlini, N., et al. "Are aligned neural networks adversarially aligned?" arXiv preprint arXiv:2306.15447 (2023).
- Zou, A., et al. "Universal and Transferable Adversarial Attacks on Aligned Language Models." arXiv preprint arXiv:2307.15043 (2023).
- MITRE ATLAS framework — https://atlas.mitre.org
- OWASP LLM Top 10 — https://owasp.org/www-project-top-10-for-large-language-model-applications/
Why is frame injection particularly effective against video understanding models?
What is the most reliable first step when red teaming a video understanding system?