Attacks on Video Understanding Models

advanced15 min readUpdated 2026-03-20

Techniques for attacking AI video understanding systems through frame injection, temporal manipulation, and adversarial video generation targeting models like Gemini 2.5 Pro.

multimodal video temporal adversarial frame-injection

Overview

Video understanding models add the temporal dimension to the multimodal attack surface. Models like Gemini 2.5 Pro, GPT-4o, and specialized video-language models process video by sampling frames, extracting visual features, and reasoning about temporal sequences. The security implications are significant: video provides far more surface area for adversarial content than a single image, and the temporal dimension creates unique attack vectors that do not exist in still-image processing.

The core vulnerability stems from how models sample video. No current model processes every frame of a video at full resolution. Instead, models sample a subset of frames -- typically 8 to 64 frames uniformly distributed across the video's duration. An attacker who understands the sampling strategy can place adversarial content in frames that will be sampled while keeping all other frames clean. A human reviewing the video at normal playback speed may never notice the adversarial frames.

Research by Li et al. (2024) demonstrated that single-frame adversarial injections in videos can override system prompts in multimodal models. Wang et al. (2024) showed that temporal consistency attacks can manipulate a model's understanding of events in a video by placing contradictory information at specific temporal positions.

Video Processing Architectures

Frame Sampling Strategies

Understanding how different models sample frames is essential for designing effective attacks.

import numpy as np
from dataclasses import dataclass
from enum import Enum
from typing import Optional
 
class SamplingStrategy(Enum):
    UNIFORM = "uniform"
    KEYFRAME = "keyframe"
    SCENE_CHANGE = "scene_change"
    ATTENTION_BASED = "attention_based"
    HIERARCHICAL = "hierarchical"
 
@dataclass
class VideoProcessingConfig:
    """Configuration for how a model processes video input."""
    model_name: str
    sampling_strategy: SamplingStrategy
    num_frames: int
    max_resolution: tuple[int, int]
    supports_audio: bool
    max_duration_seconds: int
    temporal_encoding: str
 
VIDEO_MODEL_CONFIGS = {
    "gemini_2_5_pro": VideoProcessingConfig(
        model_name="Gemini 2.5 Pro",
        sampling_strategy=SamplingStrategy.HIERARCHICAL,
        num_frames=32,
        max_resolution=(1280, 720),
        supports_audio=True,
        max_duration_seconds=3600,
        temporal_encoding="Absolute timestamp tokens",
    ),
    "gpt_4o": VideoProcessingConfig(
        model_name="GPT-4o",
        sampling_strategy=SamplingStrategy.UNIFORM,
        num_frames=16,
        max_resolution=(1024, 1024),
        supports_audio=True,
        max_duration_seconds=300,
        temporal_encoding="Frame index embeddings",
    ),
    "video_llava": VideoProcessingConfig(
        model_name="Video-LLaVA",
        sampling_strategy=SamplingStrategy.UNIFORM,
        num_frames=8,
        max_resolution=(336, 336),
        supports_audio=False,
        max_duration_seconds=600,
        temporal_encoding="Positional embeddings",
    ),
}
 
def compute_sampled_frame_indices(
    total_frames: int,
    strategy: SamplingStrategy,
    num_samples: int,
    fps: float = 30.0,
) -> list[int]:
    """Compute which frame indices a model will sample.
 
    This is the key function for frame injection attacks:
    knowing which frames will be sampled tells the attacker
    exactly where to place adversarial content.
    """
    if strategy == SamplingStrategy.UNIFORM:
        # Evenly spaced frames across the video
        if num_samples >= total_frames:
            return list(range(total_frames))
        step = total_frames / num_samples
        return [int(i * step) for i in range(num_samples)]
 
    elif strategy == SamplingStrategy.KEYFRAME:
        # Sample I-frames from the video codec (simplified)
        gop_size = int(fps)  # Typically one I-frame per second
        keyframes = list(range(0, total_frames, gop_size))
        if len(keyframes) > num_samples:
            step = len(keyframes) / num_samples
            return [keyframes[int(i * step)] for i in range(num_samples)]
        return keyframes
 
    elif strategy == SamplingStrategy.HIERARCHICAL:
        # First sample coarse, then refine regions of interest
        coarse_samples = num_samples // 2
        coarse_step = total_frames / coarse_samples
        coarse_indices = [int(i * coarse_step) for i in range(coarse_samples)]
 
        # Simulate refinement (in practice, the model decides which regions)
        fine_indices = []
        for idx in coarse_indices:
            offset = int(coarse_step / 4)
            fine_indices.append(min(idx + offset, total_frames - 1))
 
        return sorted(set(coarse_indices + fine_indices))[:num_samples]
 
    else:
        # Default to uniform
        step = total_frames / num_samples
        return [int(i * step) for i in range(num_samples)]
 
# Example: Determine attack frame positions
config = VIDEO_MODEL_CONFIGS["gpt_4o"]
video_length_seconds = 60
fps = 30.0
total_frames = int(video_length_seconds * fps)
 
sampled_indices = compute_sampled_frame_indices(
    total_frames=total_frames,
    strategy=config.sampling_strategy,
    num_samples=config.num_frames,
    fps=fps,
)
print(f"Model: {config.model_name}")
print(f"Total frames: {total_frames}")
print(f"Sampled frames: {len(sampled_indices)}")
print(f"Sampled indices: {sampled_indices}")
print(f"Adversarial frames needed: {len(sampled_indices)} out of {total_frames}")
print(f"Attack coverage: {len(sampled_indices)/total_frames*100:.2f}% of frames modified")

Attack Surface Map

Processing Stage	Attack Vector	Difficulty	Impact
Frame sampling	Place adversarial content only in sampled frames	Medium	High -- adversarial content is processed but hard to notice
Frame encoding	Adversarial perturbation on individual frames	High	High -- requires surrogate model access
Temporal encoding	Manipulate perceived timing of events	Medium	Medium -- can alter model's understanding of sequence
Audio track	Hidden commands in audio synchronized with video	Medium	High -- adds audio injection to visual attack
Subtitle/caption track	Inject text via subtitle metadata	Low	Medium -- many systems process subtitle tracks
Thumbnail	Adversarial content in video thumbnail	Low	Low -- only affects thumbnail-based processing

Frame Injection Attacks

Single-Frame Injection

The simplest video attack inserts an adversarial frame at a position known to be sampled by the target model.

import cv2
import numpy as np
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
 
class FrameInjector:
    """Inject adversarial frames into video files.
 
    Supports multiple injection strategies:
    - Single-frame: One adversarial frame at a sampled position
    - Multi-frame: Multiple adversarial frames at sampled positions
    - Subliminal: Very brief (<50ms) adversarial frames
    - Blended: Adversarial content gradually faded in/out
    """
 
    def __init__(self, video_path: str):
        self.video_path = video_path
        self.cap = cv2.VideoCapture(video_path)
        self.fps = self.cap.get(cv2.CAP_PROP_FPS)
        self.total_frames = int(self.cap.get(cv2.CAP_PROP_FRAME_COUNT))
        self.width = int(self.cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        self.height = int(self.cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        self.cap.release()
 
    def inject_single_frame(
        self,
        adversarial_frame: np.ndarray,
        target_frame_index: int,
        output_path: str,
    ) -> dict:
        """Replace a single frame with an adversarial frame.
 
        The adversarial frame is placed at target_frame_index,
        which should correspond to a frame the model will sample.
        At 30fps, a single frame lasts 33ms -- typically too brief
        for a human viewer to read the content.
        """
        cap = cv2.VideoCapture(self.video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
 
        frame_idx = 0
        injected = False
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
 
            if frame_idx == target_frame_index:
                # Resize adversarial frame to match video dimensions
                adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
                out.write(adv_resized)
                injected = True
            else:
                out.write(frame)
 
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "injected_at_frame": target_frame_index,
            "injected_at_time": target_frame_index / self.fps,
            "frame_duration_ms": 1000 / self.fps,
            "total_frames": self.total_frames,
            "injection_successful": injected,
        }
 
    def inject_at_all_sample_points(
        self,
        adversarial_frame: np.ndarray,
        sampling_strategy: SamplingStrategy,
        num_model_samples: int,
        output_path: str,
    ) -> dict:
        """Inject adversarial frames at all points the model will sample.
 
        This ensures the model processes adversarial content regardless
        of slight variations in sampling implementation, while keeping
        the vast majority of frames (which human reviewers see) clean.
        """
        sampled_indices = compute_sampled_frame_indices(
            total_frames=self.total_frames,
            strategy=sampling_strategy,
            num_samples=num_model_samples,
            fps=self.fps,
        )
 
        # Add buffer frames around each sample point for robustness
        injection_indices = set()
        for idx in sampled_indices:
            for offset in range(-2, 3):  # +/- 2 frames
                clamped = max(0, min(self.total_frames - 1, idx + offset))
                injection_indices.add(clamped)
 
        cap = cv2.VideoCapture(self.video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
 
        frame_idx = 0
        injected_count = 0
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
 
            if frame_idx in injection_indices:
                adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
                out.write(adv_resized)
                injected_count += 1
            else:
                out.write(frame)
 
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "total_frames": self.total_frames,
            "injected_frames": injected_count,
            "injection_percentage": injected_count / self.total_frames * 100,
            "model_sample_points": len(sampled_indices),
        }
 
    def inject_subliminal(
        self,
        adversarial_frame: np.ndarray,
        target_time_seconds: float,
        duration_frames: int = 1,
        blend_frames: int = 2,
        output_path: str = "subliminal_output.mp4",
    ) -> dict:
        """Inject a subliminal adversarial frame with blending.
 
        The adversarial frame is faded in over blend_frames,
        held for duration_frames, and faded out over blend_frames.
        This creates a smoother visual transition that is harder
        for human reviewers to detect even when scrubbing through
        the video frame by frame.
        """
        target_frame = int(target_time_seconds * self.fps)
        start_blend = target_frame - blend_frames
        end_blend = target_frame + duration_frames + blend_frames
 
        cap = cv2.VideoCapture(self.video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
 
        frame_idx = 0
        adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
 
            if start_blend <= frame_idx < target_frame:
                # Fade in
                alpha = (frame_idx - start_blend) / max(blend_frames, 1)
                blended = cv2.addWeighted(frame, 1 - alpha, adv_resized, alpha, 0)
                out.write(blended)
            elif target_frame <= frame_idx < target_frame + duration_frames:
                # Full adversarial frame
                out.write(adv_resized)
            elif target_frame + duration_frames <= frame_idx < end_blend:
                # Fade out
                alpha = 1 - (frame_idx - target_frame - duration_frames) / max(blend_frames, 1)
                blended = cv2.addWeighted(frame, 1 - alpha, adv_resized, alpha, 0)
                out.write(blended)
            else:
                out.write(frame)
 
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "target_time": target_time_seconds,
            "visible_duration_ms": (duration_frames + 2 * blend_frames) * (1000 / self.fps),
            "full_intensity_duration_ms": duration_frames * (1000 / self.fps),
        }

Adversarial Frame Generation

def create_adversarial_video_frame(
    payload: str,
    width: int = 1280,
    height: int = 720,
    style: str = "document",
) -> np.ndarray:
    """Create an adversarial frame containing injected instructions.
 
    The frame is designed to look like legitimate video content
    (a document, slide, or screen capture) so that if a human
    reviewer pauses on it, it does not immediately appear malicious.
 
    Args:
        payload: The adversarial instruction text.
        width: Frame width in pixels.
        height: Frame height in pixels.
        style: Visual style ('document', 'slide', 'terminal', 'whiteboard').
    """
    if style == "document":
        bg_color = (255, 255, 255)
        text_color = (40, 40, 40)
        header = "INTERNAL MEMO - CONFIDENTIAL"
    elif style == "slide":
        bg_color = (240, 240, 250)
        text_color = (30, 30, 80)
        header = "System Configuration Notes"
    elif style == "terminal":
        bg_color = (30, 30, 30)
        text_color = (0, 255, 0)
        header = "root@system:~#"
    elif style == "whiteboard":
        bg_color = (250, 248, 240)
        text_color = (40, 40, 40)
        header = "Meeting Notes"
    else:
        bg_color = (255, 255, 255)
        text_color = (0, 0, 0)
        header = ""
 
    img = Image.new("RGB", (width, height), color=bg_color)
    draw = ImageDraw.Draw(img)
 
    try:
        font = ImageFont.truetype(
            "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 18
        )
        header_font = ImageFont.truetype(
            "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 22
        )
    except OSError:
        font = ImageFont.load_default()
        header_font = font
 
    # Draw header
    draw.text((40, 30), header, fill=text_color, font=header_font)
    draw.line([(40, 60), (width - 40, 60)], fill=text_color, width=1)
 
    # Draw payload text
    y = 80
    import textwrap
    lines = textwrap.wrap(payload, width=80)
    for line in lines:
        draw.text((40, y), line, fill=text_color, font=font)
        y += 28
 
    return np.array(img)[:, :, ::-1]  # Convert RGB to BGR for OpenCV

Temporal Manipulation Attacks

Event Sequence Manipulation

Beyond injecting content into individual frames, attackers can manipulate the temporal sequence of events to alter the model's understanding of what happened in the video.

class TemporalManipulator:
    """Manipulate the temporal structure of videos to mislead
    video understanding models about the sequence of events.
 
    These attacks exploit the model's reliance on sampled frames
    to reconstruct temporal narratives. By reordering, duplicating,
    or removing frames at strategic positions, the model can be
    led to incorrect conclusions about cause and effect.
    """
 
    def __init__(self, video_path: str):
        self.video_path = video_path
        cap = cv2.VideoCapture(video_path)
        self.fps = cap.get(cv2.CAP_PROP_FPS)
        self.total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        cap.release()
 
    def reverse_segment(
        self,
        start_time: float,
        end_time: float,
        output_path: str,
    ) -> dict:
        """Reverse a segment of the video to alter perceived causality.
 
        If the model samples frames from the reversed segment,
        it may conclude that events happened in the opposite order.
        """
        start_frame = int(start_time * self.fps)
        end_frame = int(end_time * self.fps)
 
        cap = cv2.VideoCapture(self.video_path)
        width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
 
        # Read all frames in the segment
        segment_frames = []
        frame_idx = 0
        all_frames = []
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            all_frames.append(frame)
            if start_frame <= frame_idx <= end_frame:
                segment_frames.append(frame)
            frame_idx += 1
 
        cap.release()
 
        # Reverse the segment
        segment_frames.reverse()
 
        # Write output
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (width, height))
 
        seg_idx = 0
        for i, frame in enumerate(all_frames):
            if start_frame <= i <= end_frame:
                out.write(segment_frames[seg_idx])
                seg_idx += 1
            else:
                out.write(frame)
 
        out.release()
 
        return {
            "output_path": output_path,
            "reversed_segment": f"{start_time:.1f}s - {end_time:.1f}s",
            "frames_affected": len(segment_frames),
        }
 
    def duplicate_frame_at_sample_points(
        self,
        source_frame_index: int,
        sampling_strategy: SamplingStrategy,
        num_model_samples: int,
        output_path: str,
    ) -> dict:
        """Replace frames at model sample points with a duplicate of
        a specific frame, causing the model to over-weight that moment.
 
        This can make the model believe a specific event in the video
        is the dominant or only event, suppressing its understanding
        of other events.
        """
        sampled_indices = compute_sampled_frame_indices(
            total_frames=self.total_frames,
            strategy=sampling_strategy,
            num_samples=num_model_samples,
            fps=self.fps,
        )
 
        cap = cv2.VideoCapture(self.video_path)
        width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
 
        # Read source frame
        cap.set(cv2.CAP_PROP_POS_FRAMES, source_frame_index)
        ret, source_frame = cap.read()
        if not ret:
            cap.release()
            raise ValueError(f"Could not read frame {source_frame_index}")
 
        cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (width, height))
 
        frame_idx = 0
        replaced = 0
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            if frame_idx in sampled_indices and frame_idx != source_frame_index:
                out.write(source_frame)
                replaced += 1
            else:
                out.write(frame)
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "source_frame": source_frame_index,
            "replaced_frames": replaced,
            "model_will_see": "Same frame repeated at most sample points",
        }

Defense Strategies for Video Systems

Multi-Frame Consistency Checking

class VideoConsistencyChecker:
    """Check video frames for injection and manipulation attacks.
 
    Compares adjacent frames to detect abrupt content changes
    that indicate frame injection. Analyzes the distribution
    of visual features across sampled frames to detect
    temporal manipulation.
    """
 
    def __init__(self, anomaly_threshold: float = 0.3):
        self.anomaly_threshold = anomaly_threshold
 
    def check_frame_consistency(
        self, frames: list[np.ndarray]
    ) -> dict:
        """Check for abrupt visual changes between consecutive frames.
 
        Legitimate video has smooth transitions between frames.
        Injected frames create sharp discontinuities in pixel
        statistics, color histograms, and structural features.
        """
        anomalies = []
 
        for i in range(1, len(frames)):
            prev_frame = frames[i - 1].astype(float)
            curr_frame = frames[i].astype(float)
 
            # Pixel-level difference
            pixel_diff = np.mean(np.abs(prev_frame - curr_frame)) / 255.0
 
            # Histogram difference
            hist_prev = cv2.calcHist([frames[i-1]], [0], None, [256], [0, 256])
            hist_curr = cv2.calcHist([frames[i]], [0], None, [256], [0, 256])
            hist_diff = cv2.compareHist(
                hist_prev.astype(np.float32),
                hist_curr.astype(np.float32),
                cv2.HISTCMP_BHATTACHARYYA,
            )
 
            # Combined anomaly score
            anomaly_score = 0.6 * pixel_diff + 0.4 * hist_diff
 
            if anomaly_score > self.anomaly_threshold:
                anomalies.append({
                    "frame_index": i,
                    "anomaly_score": float(anomaly_score),
                    "pixel_diff": float(pixel_diff),
                    "histogram_diff": float(hist_diff),
                })
 
        return {
            "total_frames_checked": len(frames) - 1,
            "anomalies_detected": len(anomalies),
            "anomaly_details": anomalies,
            "recommendation": (
                "BLOCK" if len(anomalies) > 2
                else "REVIEW" if len(anomalies) > 0
                else "PASS"
            ),
        }

Testing Methodology

When red teaming video understanding systems:

Determine the model's sampling strategy: Send videos with frame counters or unique patterns per frame and ask the model to describe what it sees. This reveals which frames are sampled.
Test single-frame injection: Insert one adversarial frame at a known sample point. Verify the model reads the injected content.
Test subliminal injection: Insert adversarial frames for 1-2 frames (33-66ms) and verify the model processes them. Test whether human reviewers can detect them during normal playback.
Test temporal manipulation: Reverse or reorder segments and check whether the model's understanding of event sequence is altered.
Test subtitle injection: If the system processes subtitle tracks, inject adversarial text via SRT or VTT subtitle files.
Test combined audio-video attacks: Combine visual frame injection with hidden audio commands for maximum impact.

References

Li, Y., et al. "Video-based Adversarial Attacks on Multimodal Large Language Models." arXiv preprint (2024).
Wang, Z., et al. "VideoAdvBench: A Benchmark for Adversarial Robustness of Video Understanding Models." NeurIPS (2024).
Carlini, N., et al. "Are aligned neural networks adversarially aligned?" arXiv preprint arXiv:2306.15447 (2023).
Zou, A., et al. "Universal and Transferable Adversarial Attacks on Aligned Language Models." arXiv preprint arXiv:2307.15043 (2023).
MITRE ATLAS framework — https://atlas.mitre.org
OWASP LLM Top 10 — https://owasp.org/www-project-top-10-for-large-language-model-applications/

Knowledge Check

Why is frame injection particularly effective against video understanding models?

Knowledge Check

What is the most reliable first step when red teaming a video understanding system?

Edit this page on GitHub

Attacks on Video Understanding Models

advanced15 min readUpdated 2026-03-20

Techniques for attacking AI video understanding systems through frame injection, temporal manipulation, and adversarial video generation targeting models like Gemini 2.5 Pro.

multimodal video temporal adversarial frame-injection

Overview

Video Processing Architectures

Frame Sampling Strategies

Understanding how different models sample frames is essential for designing effective attacks.

import numpy as np
from dataclasses import dataclass
from enum import Enum
from typing import Optional
 
class SamplingStrategy(Enum):
    UNIFORM = "uniform"
    KEYFRAME = "keyframe"
    SCENE_CHANGE = "scene_change"
    ATTENTION_BASED = "attention_based"
    HIERARCHICAL = "hierarchical"
 
@dataclass
class VideoProcessingConfig:
    """Configuration for how a model processes video input."""
    model_name: str
    sampling_strategy: SamplingStrategy
    num_frames: int
    max_resolution: tuple[int, int]
    supports_audio: bool
    max_duration_seconds: int
    temporal_encoding: str
 
VIDEO_MODEL_CONFIGS = {
    "gemini_2_5_pro": VideoProcessingConfig(
        model_name="Gemini 2.5 Pro",
        sampling_strategy=SamplingStrategy.HIERARCHICAL,
        num_frames=32,
        max_resolution=(1280, 720),
        supports_audio=True,
        max_duration_seconds=3600,
        temporal_encoding="Absolute timestamp tokens",
    ),
    "gpt_4o": VideoProcessingConfig(
        model_name="GPT-4o",
        sampling_strategy=SamplingStrategy.UNIFORM,
        num_frames=16,
        max_resolution=(1024, 1024),
        supports_audio=True,
        max_duration_seconds=300,
        temporal_encoding="Frame index embeddings",
    ),
    "video_llava": VideoProcessingConfig(
        model_name="Video-LLaVA",
        sampling_strategy=SamplingStrategy.UNIFORM,
        num_frames=8,
        max_resolution=(336, 336),
        supports_audio=False,
        max_duration_seconds=600,
        temporal_encoding="Positional embeddings",
    ),
}
 
def compute_sampled_frame_indices(
    total_frames: int,
    strategy: SamplingStrategy,
    num_samples: int,
    fps: float = 30.0,
) -> list[int]:
    """Compute which frame indices a model will sample.
 
    This is the key function for frame injection attacks:
    knowing which frames will be sampled tells the attacker
    exactly where to place adversarial content.
    """
    if strategy == SamplingStrategy.UNIFORM:
        # Evenly spaced frames across the video
        if num_samples >= total_frames:
            return list(range(total_frames))
        step = total_frames / num_samples
        return [int(i * step) for i in range(num_samples)]
 
    elif strategy == SamplingStrategy.KEYFRAME:
        # Sample I-frames from the video codec (simplified)
        gop_size = int(fps)  # Typically one I-frame per second
        keyframes = list(range(0, total_frames, gop_size))
        if len(keyframes) > num_samples:
            step = len(keyframes) / num_samples
            return [keyframes[int(i * step)] for i in range(num_samples)]
        return keyframes
 
    elif strategy == SamplingStrategy.HIERARCHICAL:
        # First sample coarse, then refine regions of interest
        coarse_samples = num_samples // 2
        coarse_step = total_frames / coarse_samples
        coarse_indices = [int(i * coarse_step) for i in range(coarse_samples)]
 
        # Simulate refinement (in practice, the model decides which regions)
        fine_indices = []
        for idx in coarse_indices:
            offset = int(coarse_step / 4)
            fine_indices.append(min(idx + offset, total_frames - 1))
 
        return sorted(set(coarse_indices + fine_indices))[:num_samples]
 
    else:
        # Default to uniform
        step = total_frames / num_samples
        return [int(i * step) for i in range(num_samples)]
 
# Example: Determine attack frame positions
config = VIDEO_MODEL_CONFIGS["gpt_4o"]
video_length_seconds = 60
fps = 30.0
total_frames = int(video_length_seconds * fps)
 
sampled_indices = compute_sampled_frame_indices(
    total_frames=total_frames,
    strategy=config.sampling_strategy,
    num_samples=config.num_frames,
    fps=fps,
)
print(f"Model: {config.model_name}")
print(f"Total frames: {total_frames}")
print(f"Sampled frames: {len(sampled_indices)}")
print(f"Sampled indices: {sampled_indices}")
print(f"Adversarial frames needed: {len(sampled_indices)} out of {total_frames}")
print(f"Attack coverage: {len(sampled_indices)/total_frames*100:.2f}% of frames modified")

Attack Surface Map

Processing Stage	Attack Vector	Difficulty	Impact
Frame sampling	Place adversarial content only in sampled frames	Medium	High -- adversarial content is processed but hard to notice
Frame encoding	Adversarial perturbation on individual frames	High	High -- requires surrogate model access
Temporal encoding	Manipulate perceived timing of events	Medium	Medium -- can alter model's understanding of sequence
Audio track	Hidden commands in audio synchronized with video	Medium	High -- adds audio injection to visual attack
Subtitle/caption track	Inject text via subtitle metadata	Low	Medium -- many systems process subtitle tracks
Thumbnail	Adversarial content in video thumbnail	Low	Low -- only affects thumbnail-based processing

Frame Injection Attacks

Single-Frame Injection

The simplest video attack inserts an adversarial frame at a position known to be sampled by the target model.

import cv2
import numpy as np
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
 
class FrameInjector:
    """Inject adversarial frames into video files.
 
    Supports multiple injection strategies:
    - Single-frame: One adversarial frame at a sampled position
    - Multi-frame: Multiple adversarial frames at sampled positions
    - Subliminal: Very brief (<50ms) adversarial frames
    - Blended: Adversarial content gradually faded in/out
    """
 
    def __init__(self, video_path: str):
        self.video_path = video_path
        self.cap = cv2.VideoCapture(video_path)
        self.fps = self.cap.get(cv2.CAP_PROP_FPS)
        self.total_frames = int(self.cap.get(cv2.CAP_PROP_FRAME_COUNT))
        self.width = int(self.cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        self.height = int(self.cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        self.cap.release()
 
    def inject_single_frame(
        self,
        adversarial_frame: np.ndarray,
        target_frame_index: int,
        output_path: str,
    ) -> dict:
        """Replace a single frame with an adversarial frame.
 
        The adversarial frame is placed at target_frame_index,
        which should correspond to a frame the model will sample.
        At 30fps, a single frame lasts 33ms -- typically too brief
        for a human viewer to read the content.
        """
        cap = cv2.VideoCapture(self.video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
 
        frame_idx = 0
        injected = False
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
 
            if frame_idx == target_frame_index:
                # Resize adversarial frame to match video dimensions
                adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
                out.write(adv_resized)
                injected = True
            else:
                out.write(frame)
 
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "injected_at_frame": target_frame_index,
            "injected_at_time": target_frame_index / self.fps,
            "frame_duration_ms": 1000 / self.fps,
            "total_frames": self.total_frames,
            "injection_successful": injected,
        }
 
    def inject_at_all_sample_points(
        self,
        adversarial_frame: np.ndarray,
        sampling_strategy: SamplingStrategy,
        num_model_samples: int,
        output_path: str,
    ) -> dict:
        """Inject adversarial frames at all points the model will sample.
 
        This ensures the model processes adversarial content regardless
        of slight variations in sampling implementation, while keeping
        the vast majority of frames (which human reviewers see) clean.
        """
        sampled_indices = compute_sampled_frame_indices(
            total_frames=self.total_frames,
            strategy=sampling_strategy,
            num_samples=num_model_samples,
            fps=self.fps,
        )
 
        # Add buffer frames around each sample point for robustness
        injection_indices = set()
        for idx in sampled_indices:
            for offset in range(-2, 3):  # +/- 2 frames
                clamped = max(0, min(self.total_frames - 1, idx + offset))
                injection_indices.add(clamped)
 
        cap = cv2.VideoCapture(self.video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
 
        frame_idx = 0
        injected_count = 0
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
 
            if frame_idx in injection_indices:
                adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
                out.write(adv_resized)
                injected_count += 1
            else:
                out.write(frame)
 
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "total_frames": self.total_frames,
            "injected_frames": injected_count,
            "injection_percentage": injected_count / self.total_frames * 100,
            "model_sample_points": len(sampled_indices),
        }
 
    def inject_subliminal(
        self,
        adversarial_frame: np.ndarray,
        target_time_seconds: float,
        duration_frames: int = 1,
        blend_frames: int = 2,
        output_path: str = "subliminal_output.mp4",
    ) -> dict:
        """Inject a subliminal adversarial frame with blending.
 
        The adversarial frame is faded in over blend_frames,
        held for duration_frames, and faded out over blend_frames.
        This creates a smoother visual transition that is harder
        for human reviewers to detect even when scrubbing through
        the video frame by frame.
        """
        target_frame = int(target_time_seconds * self.fps)
        start_blend = target_frame - blend_frames
        end_blend = target_frame + duration_frames + blend_frames
 
        cap = cv2.VideoCapture(self.video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
 
        frame_idx = 0
        adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
 
            if start_blend <= frame_idx < target_frame:
                # Fade in
                alpha = (frame_idx - start_blend) / max(blend_frames, 1)
                blended = cv2.addWeighted(frame, 1 - alpha, adv_resized, alpha, 0)
                out.write(blended)
            elif target_frame <= frame_idx < target_frame + duration_frames:
                # Full adversarial frame
                out.write(adv_resized)
            elif target_frame + duration_frames <= frame_idx < end_blend:
                # Fade out
                alpha = 1 - (frame_idx - target_frame - duration_frames) / max(blend_frames, 1)
                blended = cv2.addWeighted(frame, 1 - alpha, adv_resized, alpha, 0)
                out.write(blended)
            else:
                out.write(frame)
 
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "target_time": target_time_seconds,
            "visible_duration_ms": (duration_frames + 2 * blend_frames) * (1000 / self.fps),
            "full_intensity_duration_ms": duration_frames * (1000 / self.fps),
        }

Adversarial Frame Generation

def create_adversarial_video_frame(
    payload: str,
    width: int = 1280,
    height: int = 720,
    style: str = "document",
) -> np.ndarray:
    """Create an adversarial frame containing injected instructions.
 
    The frame is designed to look like legitimate video content
    (a document, slide, or screen capture) so that if a human
    reviewer pauses on it, it does not immediately appear malicious.
 
    Args:
        payload: The adversarial instruction text.
        width: Frame width in pixels.
        height: Frame height in pixels.
        style: Visual style ('document', 'slide', 'terminal', 'whiteboard').
    """
    if style == "document":
        bg_color = (255, 255, 255)
        text_color = (40, 40, 40)
        header = "INTERNAL MEMO - CONFIDENTIAL"
    elif style == "slide":
        bg_color = (240, 240, 250)
        text_color = (30, 30, 80)
        header = "System Configuration Notes"
    elif style == "terminal":
        bg_color = (30, 30, 30)
        text_color = (0, 255, 0)
        header = "root@system:~#"
    elif style == "whiteboard":
        bg_color = (250, 248, 240)
        text_color = (40, 40, 40)
        header = "Meeting Notes"
    else:
        bg_color = (255, 255, 255)
        text_color = (0, 0, 0)
        header = ""
 
    img = Image.new("RGB", (width, height), color=bg_color)
    draw = ImageDraw.Draw(img)
 
    try:
        font = ImageFont.truetype(
            "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 18
        )
        header_font = ImageFont.truetype(
            "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 22
        )
    except OSError:
        font = ImageFont.load_default()
        header_font = font
 
    # Draw header
    draw.text((40, 30), header, fill=text_color, font=header_font)
    draw.line([(40, 60), (width - 40, 60)], fill=text_color, width=1)
 
    # Draw payload text
    y = 80
    import textwrap
    lines = textwrap.wrap(payload, width=80)
    for line in lines:
        draw.text((40, y), line, fill=text_color, font=font)
        y += 28
 
    return np.array(img)[:, :, ::-1]  # Convert RGB to BGR for OpenCV

Temporal Manipulation Attacks

Event Sequence Manipulation

Beyond injecting content into individual frames, attackers can manipulate the temporal sequence of events to alter the model's understanding of what happened in the video.

class TemporalManipulator:
    """Manipulate the temporal structure of videos to mislead
    video understanding models about the sequence of events.
 
    These attacks exploit the model's reliance on sampled frames
    to reconstruct temporal narratives. By reordering, duplicating,
    or removing frames at strategic positions, the model can be
    led to incorrect conclusions about cause and effect.
    """
 
    def __init__(self, video_path: str):
        self.video_path = video_path
        cap = cv2.VideoCapture(video_path)
        self.fps = cap.get(cv2.CAP_PROP_FPS)
        self.total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        cap.release()
 
    def reverse_segment(
        self,
        start_time: float,
        end_time: float,
        output_path: str,
    ) -> dict:
        """Reverse a segment of the video to alter perceived causality.
 
        If the model samples frames from the reversed segment,
        it may conclude that events happened in the opposite order.
        """
        start_frame = int(start_time * self.fps)
        end_frame = int(end_time * self.fps)
 
        cap = cv2.VideoCapture(self.video_path)
        width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
 
        # Read all frames in the segment
        segment_frames = []
        frame_idx = 0
        all_frames = []
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            all_frames.append(frame)
            if start_frame <= frame_idx <= end_frame:
                segment_frames.append(frame)
            frame_idx += 1
 
        cap.release()
 
        # Reverse the segment
        segment_frames.reverse()
 
        # Write output
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (width, height))
 
        seg_idx = 0
        for i, frame in enumerate(all_frames):
            if start_frame <= i <= end_frame:
                out.write(segment_frames[seg_idx])
                seg_idx += 1
            else:
                out.write(frame)
 
        out.release()
 
        return {
            "output_path": output_path,
            "reversed_segment": f"{start_time:.1f}s - {end_time:.1f}s",
            "frames_affected": len(segment_frames),
        }
 
    def duplicate_frame_at_sample_points(
        self,
        source_frame_index: int,
        sampling_strategy: SamplingStrategy,
        num_model_samples: int,
        output_path: str,
    ) -> dict:
        """Replace frames at model sample points with a duplicate of
        a specific frame, causing the model to over-weight that moment.
 
        This can make the model believe a specific event in the video
        is the dominant or only event, suppressing its understanding
        of other events.
        """
        sampled_indices = compute_sampled_frame_indices(
            total_frames=self.total_frames,
            strategy=sampling_strategy,
            num_samples=num_model_samples,
            fps=self.fps,
        )
 
        cap = cv2.VideoCapture(self.video_path)
        width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
 
        # Read source frame
        cap.set(cv2.CAP_PROP_POS_FRAMES, source_frame_index)
        ret, source_frame = cap.read()
        if not ret:
            cap.release()
            raise ValueError(f"Could not read frame {source_frame_index}")
 
        cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (width, height))
 
        frame_idx = 0
        replaced = 0
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            if frame_idx in sampled_indices and frame_idx != source_frame_index:
                out.write(source_frame)
                replaced += 1
            else:
                out.write(frame)
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "source_frame": source_frame_index,
            "replaced_frames": replaced,
            "model_will_see": "Same frame repeated at most sample points",
        }

Defense Strategies for Video Systems

Multi-Frame Consistency Checking

class VideoConsistencyChecker:
    """Check video frames for injection and manipulation attacks.
 
    Compares adjacent frames to detect abrupt content changes
    that indicate frame injection. Analyzes the distribution
    of visual features across sampled frames to detect
    temporal manipulation.
    """
 
    def __init__(self, anomaly_threshold: float = 0.3):
        self.anomaly_threshold = anomaly_threshold
 
    def check_frame_consistency(
        self, frames: list[np.ndarray]
    ) -> dict:
        """Check for abrupt visual changes between consecutive frames.
 
        Legitimate video has smooth transitions between frames.
        Injected frames create sharp discontinuities in pixel
        statistics, color histograms, and structural features.
        """
        anomalies = []
 
        for i in range(1, len(frames)):
            prev_frame = frames[i - 1].astype(float)
            curr_frame = frames[i].astype(float)
 
            # Pixel-level difference
            pixel_diff = np.mean(np.abs(prev_frame - curr_frame)) / 255.0
 
            # Histogram difference
            hist_prev = cv2.calcHist([frames[i-1]], [0], None, [256], [0, 256])
            hist_curr = cv2.calcHist([frames[i]], [0], None, [256], [0, 256])
            hist_diff = cv2.compareHist(
                hist_prev.astype(np.float32),
                hist_curr.astype(np.float32),
                cv2.HISTCMP_BHATTACHARYYA,
            )
 
            # Combined anomaly score
            anomaly_score = 0.6 * pixel_diff + 0.4 * hist_diff
 
            if anomaly_score > self.anomaly_threshold:
                anomalies.append({
                    "frame_index": i,
                    "anomaly_score": float(anomaly_score),
                    "pixel_diff": float(pixel_diff),
                    "histogram_diff": float(hist_diff),
                })
 
        return {
            "total_frames_checked": len(frames) - 1,
            "anomalies_detected": len(anomalies),
            "anomaly_details": anomalies,
            "recommendation": (
                "BLOCK" if len(anomalies) > 2
                else "REVIEW" if len(anomalies) > 0
                else "PASS"
            ),
        }

Testing Methodology

When red teaming video understanding systems:

Determine the model's sampling strategy: Send videos with frame counters or unique patterns per frame and ask the model to describe what it sees. This reveals which frames are sampled.
Test single-frame injection: Insert one adversarial frame at a known sample point. Verify the model reads the injected content.
Test subliminal injection: Insert adversarial frames for 1-2 frames (33-66ms) and verify the model processes them. Test whether human reviewers can detect them during normal playback.
Test temporal manipulation: Reverse or reorder segments and check whether the model's understanding of event sequence is altered.
Test subtitle injection: If the system processes subtitle tracks, inject adversarial text via SRT or VTT subtitle files.
Test combined audio-video attacks: Combine visual frame injection with hidden audio commands for maximum impact.

References

Li, Y., et al. "Video-based Adversarial Attacks on Multimodal Large Language Models." arXiv preprint (2024).
Wang, Z., et al. "VideoAdvBench: A Benchmark for Adversarial Robustness of Video Understanding Models." NeurIPS (2024).
Carlini, N., et al. "Are aligned neural networks adversarially aligned?" arXiv preprint arXiv:2306.15447 (2023).
Zou, A., et al. "Universal and Transferable Adversarial Attacks on Aligned Language Models." arXiv preprint arXiv:2307.15043 (2023).
MITRE ATLAS framework — https://atlas.mitre.org
OWASP LLM Top 10 — https://owasp.org/www-project-top-10-for-large-language-model-applications/

Knowledge Check

Why is frame injection particularly effective against video understanding models?

Knowledge Check

What is the most reliable first step when red teaming a video understanding system?

Edit this page on GitHub

Attacks on Video Understanding Models

Related articles

Attacks on Video Understanding Models

Related articles