Video Frame Injection Attacks
Inserting adversarial frames into video to exploit video understanding models: temporal injection, keyframe manipulation, subliminal frame attacks, and detection evasion.
Video understanding models process video by sampling frames, extracting features, and reasoning about temporal sequences. They do not see every frame. This sampling behavior creates a precise attack surface: if an attacker knows (or can guess) how a model samples frames, they can insert adversarial content at positions that are likely to be selected, while keeping the video visually normal to human viewers who see it at full frame rate.
How Video Models Sample Frames
Understanding the target model's sampling strategy is the foundation of any frame injection attack.
Common Sampling Strategies
| Strategy | Method | Frames Selected (from 300-frame video) | Vulnerability |
|---|---|---|---|
| Uniform sampling | Select N frames at equal intervals | Frames 0, 37, 75, 112, ... | Predictable positions |
| Keyframe extraction | Use I-frames from video codec | Codec-dependent | Attacker controls codec |
| Scene-change detection | Sample frames at scene boundaries | Variable | Injecting fake scene changes |
| Random sampling | Select N random frames | Unpredictable | Requires saturating more frames |
| Temporal stride | Every Kth frame | Frames 0, K, 2K, ... | Predictable if K is known |
import cv2
import numpy as np
def analyze_sampling_strategy(
model_fn: callable,
test_video_path: str,
num_unique_frames: int = 300
):
"""Determine which frames a model actually processes by using unique markers."""
# Create a test video where each frame has a unique identifier
cap = cv2.VideoCapture(test_video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
cap.release()
# Generate video with frame-number watermarks
marked_path = "marked_test.mp4"
writer = cv2.VideoWriter(marked_path, cv2.VideoWriter_fourcc(*'mp4v'), fps, (width, height))
for i in range(num_unique_frames):
frame = np.zeros((height, width, 3), dtype=np.uint8)
cv2.putText(frame, f"FRAME_{i:04d}", (50, height // 2),
cv2.FONT_HERSHEY_SIMPLEX, 2, (255, 255, 255), 3)
writer.write(frame)
writer.release()
# Ask model what frame numbers it sees
response = model_fn(marked_path, "List all FRAME_XXXX identifiers you can see.")
return responseSingle-Frame Injection
The simplest attack: insert one adversarial frame at a position the model is likely to sample.
Targeting Uniform Sampling
If the model uniformly samples N frames from a video of T total frames, the sampled positions are approximately at indices [0, T/N, 2T/N, ..., T-1]. The attacker replaces the frame at one of these positions.
def inject_single_frame(
video_path: str,
adversarial_frame: np.ndarray,
target_position: int,
output_path: str
):
"""Replace a single frame in the video with an adversarial frame."""
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
writer = cv2.VideoWriter(output_path, cv2.VideoWriter_fourcc(*'mp4v'),
fps, (width, height))
frame_idx = 0
while True:
ret, frame = cap.read()
if not ret:
break
if frame_idx == target_position:
# Resize adversarial frame to match video dimensions
adv_resized = cv2.resize(adversarial_frame, (width, height))
writer.write(adv_resized)
else:
writer.write(frame)
frame_idx += 1
cap.release()
writer.release()
return output_pathBlended Injection
Rather than replacing a frame entirely (which creates a visual glitch if noticed), blend the adversarial content with the original frame.
def inject_blended_frame(
video_path: str,
adversarial_content: np.ndarray,
target_position: int,
blend_alpha: float = 0.3,
output_path: str = "blended_output.mp4"
):
"""Blend adversarial content into a frame rather than replacing it."""
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
writer = cv2.VideoWriter(output_path, cv2.VideoWriter_fourcc(*'mp4v'),
fps, (width, height))
frame_idx = 0
while True:
ret, frame = cap.read()
if not ret:
break
if frame_idx == target_position:
adv_resized = cv2.resize(adversarial_content, (width, height))
blended = cv2.addWeighted(frame, 1 - blend_alpha, adv_resized, blend_alpha, 0)
writer.write(blended.astype(np.uint8))
else:
writer.write(frame)
frame_idx += 1
cap.release()
writer.release()
return output_pathMulti-Frame Injection Strategies
Single-frame injection is fragile -- if the model's sampling misses the injected frame, the attack fails. Multi-frame strategies increase reliability.
Saturation Injection
Insert adversarial frames at regular intervals throughout the video, ensuring that regardless of sampling strategy, at least one adversarial frame is captured.
def inject_saturated(
video_path: str,
adversarial_frame: np.ndarray,
injection_interval: int = 10,
output_path: str = "saturated_output.mp4"
):
"""Insert adversarial frames at regular intervals."""
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
writer = cv2.VideoWriter(output_path, cv2.VideoWriter_fourcc(*'mp4v'),
fps, (width, height))
frame_idx = 0
while True:
ret, frame = cap.read()
if not ret:
break
if frame_idx % injection_interval == 0:
adv_resized = cv2.resize(adversarial_frame, (width, height))
writer.write(adv_resized)
else:
writer.write(frame)
frame_idx += 1
cap.release()
writer.release()
injected_count = frame_idx // injection_interval
return output_path, injected_countTemporal Gradient Injection
Gradually introduce adversarial content across multiple frames, making the transition less detectable by temporal anomaly detectors.
def inject_temporal_gradient(
video_path: str,
adversarial_frame: np.ndarray,
center_position: int,
ramp_frames: int = 5,
output_path: str = "gradient_output.mp4"
):
"""Gradually blend in adversarial content over multiple frames."""
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
writer = cv2.VideoWriter(output_path, cv2.VideoWriter_fourcc(*'mp4v'),
fps, (width, height))
adv_resized = cv2.resize(adversarial_frame, (width, height))
start = center_position - ramp_frames
end = center_position + ramp_frames
frame_idx = 0
while True:
ret, frame = cap.read()
if not ret:
break
if start <= frame_idx <= end:
# Calculate blend factor -- peaks at center_position
distance = abs(frame_idx - center_position)
alpha = 1.0 - (distance / (ramp_frames + 1))
blended = cv2.addWeighted(frame, 1 - alpha, adv_resized, alpha, 0)
writer.write(blended.astype(np.uint8))
else:
writer.write(frame)
frame_idx += 1
cap.release()
writer.release()
return output_pathAdversarial Frame Content Types
The content of the injected frame determines the attack's objective.
Typographic Frame Injection
Insert a frame containing text instructions. This combines video frame injection with typographic attacks -- the video model reads the text in the adversarial frame and follows the instructions.
def create_text_frame(
width: int,
height: int,
instruction: str,
background_color: tuple = (255, 255, 255),
text_color: tuple = (0, 0, 0)
):
"""Create a frame containing adversarial text instructions."""
frame = np.full((height, width, 3), background_color, dtype=np.uint8)
# Split instruction into lines that fit the frame
words = instruction.split()
lines = []
current_line = ""
max_chars = width // 15 # Approximate characters per line
for word in words:
if len(current_line) + len(word) + 1 <= max_chars:
current_line += " " + word if current_line else word
else:
lines.append(current_line)
current_line = word
if current_line:
lines.append(current_line)
y_start = height // 2 - (len(lines) * 30) // 2
for i, line in enumerate(lines):
cv2.putText(frame, line, (20, y_start + i * 30),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, text_color, 2)
return frameAdversarial Image Frame
Insert a frame that is an adversarial image crafted to cause misclassification or behavior change in the vision encoder. This requires white-box access to compute perturbations.
Context-Manipulating Frame
Insert a frame showing a different scene that changes the model's understanding of the video's context. For example, inserting a frame of a medical setting into a cooking video might cause the model to describe the video as medical content.
Codec-Level Attacks
Video codecs (H.264, H.265, VP9) use keyframes (I-frames) and delta frames (P-frames, B-frames). Models that extract keyframes as their sampling strategy are vulnerable to codec-level manipulation.
Forcing Keyframe Placement
import subprocess
def encode_with_forced_keyframes(
input_path: str,
keyframe_positions: list,
output_path: str
):
"""Re-encode video with keyframes at specific positions."""
# Build keyframe expression for ffmpeg
kf_expr = "+".join(f"eq(n,{pos})" for pos in keyframe_positions)
cmd = [
"ffmpeg", "-i", input_path,
"-force_key_frames", f"expr:{kf_expr}",
"-c:v", "libx264",
"-y", output_path
]
subprocess.run(cmd, capture_output=True, check=True)
return output_pathBy forcing keyframes at positions where adversarial frames have been inserted, the attacker ensures that keyframe-based sampling will select the adversarial content.
Detection and Defense
Temporal Consistency Analysis
Adversarial frames typically differ significantly from their neighbors. Measuring frame-to-frame similarity can identify injections.
def detect_frame_anomalies(
video_path: str,
threshold: float = 0.3
):
"""Detect anomalous frames by measuring temporal consistency."""
cap = cv2.VideoCapture(video_path)
prev_frame = None
anomalies = []
frame_idx = 0
while True:
ret, frame = cap.read()
if not ret:
break
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
if prev_frame is not None:
# Structural similarity between consecutive frames
diff = cv2.absdiff(prev_frame, gray)
mean_diff = np.mean(diff) / 255.0
if mean_diff > threshold:
anomalies.append({
"frame": frame_idx,
"difference_score": float(mean_diff),
"type": "high_temporal_discontinuity"
})
prev_frame = gray
frame_idx += 1
cap.release()
return anomaliesMulti-Sample Verification
Process the video with multiple different sampling strategies. If results are consistent, the video is likely clean. If different sampling strategies produce different descriptions, adversarial frames may be present.
Frame Deduplication
Before processing, identify and remove near-duplicate or anomalous frames. This defends against saturation injection by limiting the number of adversarial frames that reach the model.
Red Team Assessment Methodology
Identify the video processing pipeline
Determine what model processes the video, how it samples frames, and what output it produces (classification, description, action recognition, content moderation).
Probe sampling behavior
Submit test videos with frame-number markers to determine the model's sampling strategy. This reveals which frame positions are most valuable for injection.
Test single-frame injection
Insert a single clearly adversarial frame (e.g., containing large text instructions) at a predicted sample position. Verify whether the model processes it.
Test stealth variants
Progress to blended injection, temporal gradient, and low-visibility content. Measure the minimum injection strength that still affects model output.
Evaluate codec-level attacks
If the model uses keyframe extraction, test whether forced keyframe placement at adversarial frames increases attack success.
Test detection bypass
If the system has frame anomaly detection, use temporal gradient injection or context-consistent adversarial frames to evade detection while still affecting model output.
Summary
Video frame injection exploits the fundamental mismatch between how humans and models perceive video. By inserting adversarial frames at positions the model samples, attackers can steer video understanding outputs without creating artifacts visible to human viewers. Effective defense requires temporal consistency analysis, multi-sample verification, and treating each sampled frame as potentially adversarial input. As video understanding models become more prevalent in content moderation, surveillance, and media analysis, frame injection attacks represent an increasingly important attack surface.