攻擊s on Video Understanding 模型s

Advanced14 min readUpdated 2026-03-20

Techniques for attacking AI video understanding systems through frame injection, temporal manipulation, and adversarial video generation targeting models like Gemini 2.5 Pro.

multimodal video temporal adversarial frame-injection

概覽

Video 理解 models add the temporal dimension to the multimodal 攻擊面. Models like Gemini 2.5 Pro, GPT-4o, and specialized video-language models process video by sampling frames, extracting visual features, and reasoning about temporal sequences. The 安全 implications are significant: video provides far more surface area for 對抗性 content than a single image, and the temporal dimension creates unique attack vectors that do not exist in still-image processing.

The core 漏洞 stems from how models sample video. No current model processes every frame of a video at full resolution. Instead, models sample a subset of frames -- typically 8 to 64 frames uniformly distributed across the video's duration. 攻擊者 who understands the sampling strategy can place 對抗性 content in frames that will be sampled while keeping all other frames clean. A human reviewing the video at normal playback speed may never notice the 對抗性 frames.

Research by Li et al. (2024) demonstrated that single-frame 對抗性 injections in videos can override system prompts in multimodal models. Wang et al. (2024) showed that temporal consistency attacks can manipulate a model's 理解 of events in a video by placing contradictory information at specific temporal positions.

Video Processing Architectures

Frame Sampling Strategies

理解 how different models sample frames is essential for designing effective attacks.

import numpy as np
from dataclasses import dataclass
from enum import Enum
from typing import Optional
 
 
class SamplingStrategy(Enum):
    UNIFORM = "uniform"
    KEYFRAME = "keyframe"
    SCENE_CHANGE = "scene_change"
    ATTENTION_BASED = "attention_based"
    HIERARCHICAL = "hierarchical"
 
 
@dataclass
class VideoProcessingConfig:
    """Configuration for how a model processes video 輸入."""
    model_name: str
    sampling_strategy: SamplingStrategy
    num_frames: int
    max_resolution: tuple[int, int]
    supports_audio: bool
    max_duration_seconds: int
    temporal_encoding: str
 
 
VIDEO_MODEL_CONFIGS = {
    "gemini_2_5_pro": VideoProcessingConfig(
        model_name="Gemini 2.5 Pro",
        sampling_strategy=SamplingStrategy.HIERARCHICAL,
        num_frames=32,
        max_resolution=(1280, 720),
        supports_audio=True,
        max_duration_seconds=3600,
        temporal_encoding="Absolute timestamp 符元",
    ),
    "gpt_4o": VideoProcessingConfig(
        model_name="GPT-4o",
        sampling_strategy=SamplingStrategy.UNIFORM,
        num_frames=16,
        max_resolution=(1024, 1024),
        supports_audio=True,
        max_duration_seconds=300,
        temporal_encoding="Frame index 嵌入向量",
    ),
    "video_llava": VideoProcessingConfig(
        model_name="Video-LLaVA",
        sampling_strategy=SamplingStrategy.UNIFORM,
        num_frames=8,
        max_resolution=(336, 336),
        supports_audio=False,
        max_duration_seconds=600,
        temporal_encoding="Positional 嵌入向量",
    ),
}
 
 
def compute_sampled_frame_indices(
    total_frames: int,
    strategy: SamplingStrategy,
    num_samples: int,
    fps: float = 30.0,
) -> list[int]:
    """Compute which frame indices a model will sample.
 
    這是 the key function for frame injection attacks:
    knowing which frames will be sampled tells 攻擊者
    exactly where to place 對抗性 content.
    """
    if strategy == SamplingStrategy.UNIFORM:
        # Evenly spaced frames across the video
        if num_samples >= total_frames:
            return list(range(total_frames))
        step = total_frames / num_samples
        return [int(i * step) for i in range(num_samples)]
 
    elif strategy == SamplingStrategy.KEYFRAME:
        # Sample I-frames from the video codec (simplified)
        gop_size = int(fps)  # Typically one I-frame per second
        keyframes = list(range(0, total_frames, gop_size))
        if len(keyframes) > num_samples:
            step = len(keyframes) / num_samples
            return [keyframes[int(i * step)] for i in range(num_samples)]
        return keyframes
 
    elif strategy == SamplingStrategy.HIERARCHICAL:
        # First sample coarse, then refine regions of interest
        coarse_samples = num_samples // 2
        coarse_step = total_frames / coarse_samples
        coarse_indices = [int(i * coarse_step) for i in range(coarse_samples)]
 
        # Simulate refinement (in practice, 模型 decides which regions)
        fine_indices = []
        for idx in coarse_indices:
            offset = int(coarse_step / 4)
            fine_indices.append(min(idx + offset, total_frames - 1))
 
        return sorted(set(coarse_indices + fine_indices))[:num_samples]
 
    else:
        # Default to uniform
        step = total_frames / num_samples
        return [int(i * step) for i in range(num_samples)]
 
 
# 範例: Determine attack frame positions
config = VIDEO_MODEL_CONFIGS["gpt_4o"]
video_length_seconds = 60
fps = 30.0
total_frames = int(video_length_seconds * fps)
 
sampled_indices = compute_sampled_frame_indices(
    total_frames=total_frames,
    strategy=config.sampling_strategy,
    num_samples=config.num_frames,
    fps=fps,
)
print(f"Model: {config.model_name}")
print(f"Total frames: {total_frames}")
print(f"Sampled frames: {len(sampled_indices)}")
print(f"Sampled indices: {sampled_indices}")
print(f"對抗性 frames needed: {len(sampled_indices)} out of {total_frames}")
print(f"攻擊 coverage: {len(sampled_indices)/total_frames*100:.2f}% of frames modified")

攻擊 Surface Map

Processing Stage	攻擊 Vector	Difficulty	Impact
Frame sampling	Place 對抗性 content only in sampled frames	Medium	High -- 對抗性 content is processed but hard to notice
Frame encoding	對抗性 perturbation on individual frames	High	High -- requires surrogate model access
Temporal encoding	Manipulate perceived timing of events	Medium	Medium -- can alter model's 理解 of sequence
Audio track	Hidden commands in audio synchronized with video	Medium	High -- adds audio injection to visual attack
Subtitle/caption track	Inject text via subtitle metadata	Low	Medium -- many systems process subtitle tracks
Thumbnail	對抗性 content in video thumbnail	Low	Low -- only affects thumbnail-based processing

Frame Injection 攻擊

Single-Frame Injection

The simplest video attack inserts an 對抗性 frame at a position known to be sampled by the target model.

import cv2
import numpy as np
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
 
 
class FrameInjector:
    """Inject 對抗性 frames into video files.
 
    Supports multiple injection strategies:
    - Single-frame: One 對抗性 frame at a sampled position
    - Multi-frame: Multiple 對抗性 frames at sampled positions
    - Subliminal: Very brief (<50ms) 對抗性 frames
    - Blended: 對抗性 content gradually faded in/out
    """
 
    def __init__(self, video_path: str):
        self.video_path = video_path
        self.cap = cv2.VideoCapture(video_path)
        self.fps = self.cap.get(cv2.CAP_PROP_FPS)
        self.total_frames = int(self.cap.get(cv2.CAP_PROP_FRAME_COUNT))
        self.width = int(self.cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        self.height = int(self.cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        self.cap.release()
 
    def inject_single_frame(
        self,
        adversarial_frame: np.ndarray,
        target_frame_index: int,
        output_path: str,
    ) -> dict:
        """Replace a single frame with an 對抗性 frame.
 
        The 對抗性 frame is placed at target_frame_index,
        which should correspond to a frame 模型 will sample.
        At 30fps, a single frame lasts 33ms -- typically too brief
        for a human viewer to read the content.
        """
        cap = cv2.VideoCapture(self.video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
 
        frame_idx = 0
        injected = False
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
 
            if frame_idx == target_frame_index:
                # Resize 對抗性 frame to match video dimensions
                adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
                out.write(adv_resized)
                injected = True
            else:
                out.write(frame)
 
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "injected_at_frame": target_frame_index,
            "injected_at_time": target_frame_index / self.fps,
            "frame_duration_ms": 1000 / self.fps,
            "total_frames": self.total_frames,
            "injection_successful": injected,
        }
 
    def inject_at_all_sample_points(
        self,
        adversarial_frame: np.ndarray,
        sampling_strategy: SamplingStrategy,
        num_model_samples: int,
        output_path: str,
    ) -> dict:
        """Inject 對抗性 frames at all points 模型 will sample.
 
        This ensures 模型 processes 對抗性 content regardless
        of slight variations in sampling 實作, while keeping
        the vast majority of frames (which human reviewers see) clean.
        """
        sampled_indices = compute_sampled_frame_indices(
            total_frames=self.total_frames,
            strategy=sampling_strategy,
            num_samples=num_model_samples,
            fps=self.fps,
        )
 
        # Add buffer frames around each sample point for robustness
        injection_indices = set()
        for idx in sampled_indices:
            for offset in range(-2, 3):  # +/- 2 frames
                clamped = max(0, min(self.total_frames - 1, idx + offset))
                injection_indices.add(clamped)
 
        cap = cv2.VideoCapture(self.video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
 
        frame_idx = 0
        injected_count = 0
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
 
            if frame_idx in injection_indices:
                adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
                out.write(adv_resized)
                injected_count += 1
            else:
                out.write(frame)
 
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "total_frames": self.total_frames,
            "injected_frames": injected_count,
            "injection_percentage": injected_count / self.total_frames * 100,
            "model_sample_points": len(sampled_indices),
        }
 
    def inject_subliminal(
        self,
        adversarial_frame: np.ndarray,
        target_time_seconds: float,
        duration_frames: int = 1,
        blend_frames: int = 2,
        output_path: str = "subliminal_output.mp4",
    ) -> dict:
        """Inject a subliminal 對抗性 frame with blending.
 
        The 對抗性 frame is faded in over blend_frames,
        held for duration_frames, and faded out over blend_frames.
        This creates a smoother visual transition that is harder
        for human reviewers to detect even when scrubbing through
        the video frame by frame.
        """
        target_frame = int(target_time_seconds * self.fps)
        start_blend = target_frame - blend_frames
        end_blend = target_frame + duration_frames + blend_frames
 
        cap = cv2.VideoCapture(self.video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
 
        frame_idx = 0
        adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
 
            if start_blend <= frame_idx < target_frame:
                # Fade in
                alpha = (frame_idx - start_blend) / max(blend_frames, 1)
                blended = cv2.addWeighted(frame, 1 - alpha, adv_resized, alpha, 0)
                out.write(blended)
            elif target_frame <= frame_idx < target_frame + duration_frames:
                # Full 對抗性 frame
                out.write(adv_resized)
            elif target_frame + duration_frames <= frame_idx < end_blend:
                # Fade out
                alpha = 1 - (frame_idx - target_frame - duration_frames) / max(blend_frames, 1)
                blended = cv2.addWeighted(frame, 1 - alpha, adv_resized, alpha, 0)
                out.write(blended)
            else:
                out.write(frame)
 
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "target_time": target_time_seconds,
            "visible_duration_ms": (duration_frames + 2 * blend_frames) * (1000 / self.fps),
            "full_intensity_duration_ms": duration_frames * (1000 / self.fps),
        }

對抗性 Frame Generation

def create_adversarial_video_frame(
    payload: str,
    width: int = 1280,
    height: int = 720,
    style: str = "document",
) -> np.ndarray:
    """Create an 對抗性 frame containing injected instructions.
 
    The frame is designed to look like legitimate video content
    (a document, slide, or screen capture) so that if a human
    reviewer pauses on it, it does not immediately appear malicious.
 
    Args:
        payload: The 對抗性 instruction text.
        width: Frame width in pixels.
        height: Frame height in pixels.
        style: Visual style ('document', 'slide', 'terminal', 'whiteboard').
    """
    if style == "document":
        bg_color = (255, 255, 255)
        text_color = (40, 40, 40)
        header = "INTERNAL MEMO - CONFIDENTIAL"
    elif style == "slide":
        bg_color = (240, 240, 250)
        text_color = (30, 30, 80)
        header = "System Configuration Notes"
    elif style == "terminal":
        bg_color = (30, 30, 30)
        text_color = (0, 255, 0)
        header = "root@system:~#"
    elif style == "whiteboard":
        bg_color = (250, 248, 240)
        text_color = (40, 40, 40)
        header = "Meeting Notes"
    else:
        bg_color = (255, 255, 255)
        text_color = (0, 0, 0)
        header = ""
 
    img = Image.new("RGB", (width, height), color=bg_color)
    draw = ImageDraw.Draw(img)
 
    try:
        font = ImageFont.truetype(
            "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 18
        )
        header_font = ImageFont.truetype(
            "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 22
        )
    except OSError:
        font = ImageFont.load_default()
        header_font = font
 
    # Draw header
    draw.text((40, 30), header, fill=text_color, font=header_font)
    draw.line([(40, 60), (width - 40, 60)], fill=text_color, width=1)
 
    # Draw payload text
    y = 80
    import textwrap
    lines = textwrap.wrap(payload, width=80)
    for line in lines:
        draw.text((40, y), line, fill=text_color, font=font)
        y += 28
 
    return np.array(img)[:, :, ::-1]  # Convert RGB to BGR for OpenCV

Temporal Manipulation 攻擊

Event Sequence Manipulation

Beyond injecting content into individual frames, attackers can manipulate the temporal sequence of events to alter 模型's 理解 of what happened in the video.

class TemporalManipulator:
    """Manipulate the temporal structure of videos to mislead
    video 理解 models about the sequence of events.
 
    These attacks 利用 模型's reliance on sampled frames
    to reconstruct temporal narratives. By reordering, duplicating,
    or removing frames at strategic positions, 模型 can be
    led to incorrect conclusions about cause and effect.
    """
 
    def __init__(self, video_path: str):
        self.video_path = video_path
        cap = cv2.VideoCapture(video_path)
        self.fps = cap.get(cv2.CAP_PROP_FPS)
        self.total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        cap.release()
 
    def reverse_segment(
        self,
        start_time: float,
        end_time: float,
        output_path: str,
    ) -> dict:
        """Reverse a segment of the video to alter perceived causality.
 
        If 模型 samples frames from the reversed segment,
        it may conclude that events happened in the opposite order.
        """
        start_frame = int(start_time * self.fps)
        end_frame = int(end_time * self.fps)
 
        cap = cv2.VideoCapture(self.video_path)
        width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
 
        # Read all frames in the segment
        segment_frames = []
        frame_idx = 0
        all_frames = []
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            all_frames.append(frame)
            if start_frame <= frame_idx <= end_frame:
                segment_frames.append(frame)
            frame_idx += 1
 
        cap.release()
 
        # Reverse the segment
        segment_frames.reverse()
 
        # Write 輸出
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (width, height))
 
        seg_idx = 0
        for i, frame in enumerate(all_frames):
            if start_frame <= i <= end_frame:
                out.write(segment_frames[seg_idx])
                seg_idx += 1
            else:
                out.write(frame)
 
        out.release()
 
        return {
            "output_path": output_path,
            "reversed_segment": f"{start_time:.1f}s - {end_time:.1f}s",
            "frames_affected": len(segment_frames),
        }
 
    def duplicate_frame_at_sample_points(
        self,
        source_frame_index: int,
        sampling_strategy: SamplingStrategy,
        num_model_samples: int,
        output_path: str,
    ) -> dict:
        """Replace frames at model sample points with a duplicate of
        a specific frame, causing 模型 to over-weight that moment.
 
        This can make 模型 believe a specific event in the video
        is the dominant or only event, suppressing its 理解
        of other events.
        """
        sampled_indices = compute_sampled_frame_indices(
            total_frames=self.total_frames,
            strategy=sampling_strategy,
            num_samples=num_model_samples,
            fps=self.fps,
        )
 
        cap = cv2.VideoCapture(self.video_path)
        width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
 
        # Read source frame
        cap.set(cv2.CAP_PROP_POS_FRAMES, source_frame_index)
        ret, source_frame = cap.read()
        if not ret:
            cap.release()
            raise ValueError(f"Could not read frame {source_frame_index}")
 
        cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (width, height))
 
        frame_idx = 0
        replaced = 0
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            if frame_idx in sampled_indices and frame_idx != source_frame_index:
                out.write(source_frame)
                replaced += 1
            else:
                out.write(frame)
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "source_frame": source_frame_index,
            "replaced_frames": replaced,
            "model_will_see": "Same frame repeated at most sample points",
        }

防禦策略 for Video Systems

Multi-Frame Consistency Checking

class VideoConsistencyChecker:
    """Check video frames for injection and manipulation attacks.
 
    Compares adjacent frames to detect abrupt content changes
    that indicate frame injection. Analyzes the distribution
    of visual features across sampled frames to detect
    temporal manipulation.
    """
 
    def __init__(self, anomaly_threshold: float = 0.3):
        self.anomaly_threshold = anomaly_threshold
 
    def check_frame_consistency(
        self, frames: list[np.ndarray]
    ) -> dict:
        """Check for abrupt visual changes between consecutive frames.
 
        Legitimate video has smooth transitions between frames.
        Injected frames create sharp discontinuities in pixel
        statistics, color histograms, and structural features.
        """
        anomalies = []
 
        for i in range(1, len(frames)):
            prev_frame = frames[i - 1].astype(float)
            curr_frame = frames[i].astype(float)
 
            # Pixel-level difference
            pixel_diff = np.mean(np.abs(prev_frame - curr_frame)) / 255.0
 
            # Histogram difference
            hist_prev = cv2.calcHist([frames[i-1]], [0], None, [256], [0, 256])
            hist_curr = cv2.calcHist([frames[i]], [0], None, [256], [0, 256])
            hist_diff = cv2.compareHist(
                hist_prev.astype(np.float32),
                hist_curr.astype(np.float32),
                cv2.HISTCMP_BHATTACHARYYA,
            )
 
            # Combined anomaly score
            anomaly_score = 0.6 * pixel_diff + 0.4 * hist_diff
 
            if anomaly_score > self.anomaly_threshold:
                anomalies.append({
                    "frame_index": i,
                    "anomaly_score": float(anomaly_score),
                    "pixel_diff": float(pixel_diff),
                    "histogram_diff": float(hist_diff),
                })
 
        return {
            "total_frames_checked": len(frames) - 1,
            "anomalies_detected": len(anomalies),
            "anomaly_details": anomalies,
            "recommendation": (
                "BLOCK" if len(anomalies) > 2
                else "REVIEW" if len(anomalies) > 0
                else "PASS"
            ),
        }

測試 Methodology

When 紅隊演練 video 理解 systems:

Determine 模型's sampling strategy: Send videos with frame counters or unique patterns per frame and ask 模型 to describe what it sees. This reveals which frames are sampled.
測試 single-frame injection: Insert one 對抗性 frame at a known sample point. Verify 模型 reads the injected content.
測試 subliminal injection: Insert 對抗性 frames for 1-2 frames (33-66ms) and verify 模型 processes them. 測試 whether human reviewers can detect them during normal playback.
測試 temporal manipulation: Reverse or reorder segments and check whether 模型's 理解 of event sequence is altered.
測試 subtitle injection: If 系統 processes subtitle tracks, inject 對抗性 text via SRT or VTT subtitle files.
測試 combined audio-video attacks: Combine visual frame injection with hidden audio commands for maximum impact.

參考文獻

Li, Y., et al. "Video-based 對抗性攻擊 on Multimodal Large Language Models." arXiv preprint (2024).
Wang, Z., et al. "VideoAdvBench: A Benchmark for 對抗性 Robustness of Video 理解 Models." NeurIPS (2024).
Carlini, N., et al. "Are aligned neural networks adversarially aligned?" arXiv preprint arXiv:2306.15447 (2023).
Zou, A., et al. "Universal and Transferable 對抗性攻擊 on Aligned Language Models." arXiv preprint arXiv:2307.15043 (2023).
MITRE ATLAS framework — https://atlas.mitre.org
OWASP LLM Top 10 — https://owasp.org/www-project-top-10-for-large-language-model-applications/

Knowledge Check

Why is frame injection particularly effective against video 理解 models?

Knowledge Check

What is the most reliable first step when 紅隊演練 a video 理解 system?

攻擊s on Video Understanding 模型s

Advanced14 min readUpdated 2026-03-20

Techniques for attacking AI video understanding systems through frame injection, temporal manipulation, and adversarial video generation targeting models like Gemini 2.5 Pro.

multimodal video temporal adversarial frame-injection

概覽

Video Processing Architectures

Frame Sampling Strategies

理解 how different models sample frames is essential for designing effective attacks.

import numpy as np
from dataclasses import dataclass
from enum import Enum
from typing import Optional
 
 
class SamplingStrategy(Enum):
    UNIFORM = "uniform"
    KEYFRAME = "keyframe"
    SCENE_CHANGE = "scene_change"
    ATTENTION_BASED = "attention_based"
    HIERARCHICAL = "hierarchical"
 
 
@dataclass
class VideoProcessingConfig:
    """Configuration for how a model processes video 輸入."""
    model_name: str
    sampling_strategy: SamplingStrategy
    num_frames: int
    max_resolution: tuple[int, int]
    supports_audio: bool
    max_duration_seconds: int
    temporal_encoding: str
 
 
VIDEO_MODEL_CONFIGS = {
    "gemini_2_5_pro": VideoProcessingConfig(
        model_name="Gemini 2.5 Pro",
        sampling_strategy=SamplingStrategy.HIERARCHICAL,
        num_frames=32,
        max_resolution=(1280, 720),
        supports_audio=True,
        max_duration_seconds=3600,
        temporal_encoding="Absolute timestamp 符元",
    ),
    "gpt_4o": VideoProcessingConfig(
        model_name="GPT-4o",
        sampling_strategy=SamplingStrategy.UNIFORM,
        num_frames=16,
        max_resolution=(1024, 1024),
        supports_audio=True,
        max_duration_seconds=300,
        temporal_encoding="Frame index 嵌入向量",
    ),
    "video_llava": VideoProcessingConfig(
        model_name="Video-LLaVA",
        sampling_strategy=SamplingStrategy.UNIFORM,
        num_frames=8,
        max_resolution=(336, 336),
        supports_audio=False,
        max_duration_seconds=600,
        temporal_encoding="Positional 嵌入向量",
    ),
}
 
 
def compute_sampled_frame_indices(
    total_frames: int,
    strategy: SamplingStrategy,
    num_samples: int,
    fps: float = 30.0,
) -> list[int]:
    """Compute which frame indices a model will sample.
 
    這是 the key function for frame injection attacks:
    knowing which frames will be sampled tells 攻擊者
    exactly where to place 對抗性 content.
    """
    if strategy == SamplingStrategy.UNIFORM:
        # Evenly spaced frames across the video
        if num_samples >= total_frames:
            return list(range(total_frames))
        step = total_frames / num_samples
        return [int(i * step) for i in range(num_samples)]
 
    elif strategy == SamplingStrategy.KEYFRAME:
        # Sample I-frames from the video codec (simplified)
        gop_size = int(fps)  # Typically one I-frame per second
        keyframes = list(range(0, total_frames, gop_size))
        if len(keyframes) > num_samples:
            step = len(keyframes) / num_samples
            return [keyframes[int(i * step)] for i in range(num_samples)]
        return keyframes
 
    elif strategy == SamplingStrategy.HIERARCHICAL:
        # First sample coarse, then refine regions of interest
        coarse_samples = num_samples // 2
        coarse_step = total_frames / coarse_samples
        coarse_indices = [int(i * coarse_step) for i in range(coarse_samples)]
 
        # Simulate refinement (in practice, 模型 decides which regions)
        fine_indices = []
        for idx in coarse_indices:
            offset = int(coarse_step / 4)
            fine_indices.append(min(idx + offset, total_frames - 1))
 
        return sorted(set(coarse_indices + fine_indices))[:num_samples]
 
    else:
        # Default to uniform
        step = total_frames / num_samples
        return [int(i * step) for i in range(num_samples)]
 
 
# 範例: Determine attack frame positions
config = VIDEO_MODEL_CONFIGS["gpt_4o"]
video_length_seconds = 60
fps = 30.0
total_frames = int(video_length_seconds * fps)
 
sampled_indices = compute_sampled_frame_indices(
    total_frames=total_frames,
    strategy=config.sampling_strategy,
    num_samples=config.num_frames,
    fps=fps,
)
print(f"Model: {config.model_name}")
print(f"Total frames: {total_frames}")
print(f"Sampled frames: {len(sampled_indices)}")
print(f"Sampled indices: {sampled_indices}")
print(f"對抗性 frames needed: {len(sampled_indices)} out of {total_frames}")
print(f"攻擊 coverage: {len(sampled_indices)/total_frames*100:.2f}% of frames modified")

攻擊 Surface Map

Processing Stage	攻擊 Vector	Difficulty	Impact
Frame sampling	Place 對抗性 content only in sampled frames	Medium	High -- 對抗性 content is processed but hard to notice
Frame encoding	對抗性 perturbation on individual frames	High	High -- requires surrogate model access
Temporal encoding	Manipulate perceived timing of events	Medium	Medium -- can alter model's 理解 of sequence
Audio track	Hidden commands in audio synchronized with video	Medium	High -- adds audio injection to visual attack
Subtitle/caption track	Inject text via subtitle metadata	Low	Medium -- many systems process subtitle tracks
Thumbnail	對抗性 content in video thumbnail	Low	Low -- only affects thumbnail-based processing

Frame Injection 攻擊

Single-Frame Injection

The simplest video attack inserts an 對抗性 frame at a position known to be sampled by the target model.

import cv2
import numpy as np
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
 
 
class FrameInjector:
    """Inject 對抗性 frames into video files.
 
    Supports multiple injection strategies:
    - Single-frame: One 對抗性 frame at a sampled position
    - Multi-frame: Multiple 對抗性 frames at sampled positions
    - Subliminal: Very brief (<50ms) 對抗性 frames
    - Blended: 對抗性 content gradually faded in/out
    """
 
    def __init__(self, video_path: str):
        self.video_path = video_path
        self.cap = cv2.VideoCapture(video_path)
        self.fps = self.cap.get(cv2.CAP_PROP_FPS)
        self.total_frames = int(self.cap.get(cv2.CAP_PROP_FRAME_COUNT))
        self.width = int(self.cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        self.height = int(self.cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        self.cap.release()
 
    def inject_single_frame(
        self,
        adversarial_frame: np.ndarray,
        target_frame_index: int,
        output_path: str,
    ) -> dict:
        """Replace a single frame with an 對抗性 frame.
 
        The 對抗性 frame is placed at target_frame_index,
        which should correspond to a frame 模型 will sample.
        At 30fps, a single frame lasts 33ms -- typically too brief
        for a human viewer to read the content.
        """
        cap = cv2.VideoCapture(self.video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
 
        frame_idx = 0
        injected = False
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
 
            if frame_idx == target_frame_index:
                # Resize 對抗性 frame to match video dimensions
                adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
                out.write(adv_resized)
                injected = True
            else:
                out.write(frame)
 
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "injected_at_frame": target_frame_index,
            "injected_at_time": target_frame_index / self.fps,
            "frame_duration_ms": 1000 / self.fps,
            "total_frames": self.total_frames,
            "injection_successful": injected,
        }
 
    def inject_at_all_sample_points(
        self,
        adversarial_frame: np.ndarray,
        sampling_strategy: SamplingStrategy,
        num_model_samples: int,
        output_path: str,
    ) -> dict:
        """Inject 對抗性 frames at all points 模型 will sample.
 
        This ensures 模型 processes 對抗性 content regardless
        of slight variations in sampling 實作, while keeping
        the vast majority of frames (which human reviewers see) clean.
        """
        sampled_indices = compute_sampled_frame_indices(
            total_frames=self.total_frames,
            strategy=sampling_strategy,
            num_samples=num_model_samples,
            fps=self.fps,
        )
 
        # Add buffer frames around each sample point for robustness
        injection_indices = set()
        for idx in sampled_indices:
            for offset in range(-2, 3):  # +/- 2 frames
                clamped = max(0, min(self.total_frames - 1, idx + offset))
                injection_indices.add(clamped)
 
        cap = cv2.VideoCapture(self.video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
 
        frame_idx = 0
        injected_count = 0
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
 
            if frame_idx in injection_indices:
                adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
                out.write(adv_resized)
                injected_count += 1
            else:
                out.write(frame)
 
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "total_frames": self.total_frames,
            "injected_frames": injected_count,
            "injection_percentage": injected_count / self.total_frames * 100,
            "model_sample_points": len(sampled_indices),
        }
 
    def inject_subliminal(
        self,
        adversarial_frame: np.ndarray,
        target_time_seconds: float,
        duration_frames: int = 1,
        blend_frames: int = 2,
        output_path: str = "subliminal_output.mp4",
    ) -> dict:
        """Inject a subliminal 對抗性 frame with blending.
 
        The 對抗性 frame is faded in over blend_frames,
        held for duration_frames, and faded out over blend_frames.
        This creates a smoother visual transition that is harder
        for human reviewers to detect even when scrubbing through
        the video frame by frame.
        """
        target_frame = int(target_time_seconds * self.fps)
        start_blend = target_frame - blend_frames
        end_blend = target_frame + duration_frames + blend_frames
 
        cap = cv2.VideoCapture(self.video_path)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (self.width, self.height))
 
        frame_idx = 0
        adv_resized = cv2.resize(adversarial_frame, (self.width, self.height))
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
 
            if start_blend <= frame_idx < target_frame:
                # Fade in
                alpha = (frame_idx - start_blend) / max(blend_frames, 1)
                blended = cv2.addWeighted(frame, 1 - alpha, adv_resized, alpha, 0)
                out.write(blended)
            elif target_frame <= frame_idx < target_frame + duration_frames:
                # Full 對抗性 frame
                out.write(adv_resized)
            elif target_frame + duration_frames <= frame_idx < end_blend:
                # Fade out
                alpha = 1 - (frame_idx - target_frame - duration_frames) / max(blend_frames, 1)
                blended = cv2.addWeighted(frame, 1 - alpha, adv_resized, alpha, 0)
                out.write(blended)
            else:
                out.write(frame)
 
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "target_time": target_time_seconds,
            "visible_duration_ms": (duration_frames + 2 * blend_frames) * (1000 / self.fps),
            "full_intensity_duration_ms": duration_frames * (1000 / self.fps),
        }

對抗性 Frame Generation

def create_adversarial_video_frame(
    payload: str,
    width: int = 1280,
    height: int = 720,
    style: str = "document",
) -> np.ndarray:
    """Create an 對抗性 frame containing injected instructions.
 
    The frame is designed to look like legitimate video content
    (a document, slide, or screen capture) so that if a human
    reviewer pauses on it, it does not immediately appear malicious.
 
    Args:
        payload: The 對抗性 instruction text.
        width: Frame width in pixels.
        height: Frame height in pixels.
        style: Visual style ('document', 'slide', 'terminal', 'whiteboard').
    """
    if style == "document":
        bg_color = (255, 255, 255)
        text_color = (40, 40, 40)
        header = "INTERNAL MEMO - CONFIDENTIAL"
    elif style == "slide":
        bg_color = (240, 240, 250)
        text_color = (30, 30, 80)
        header = "System Configuration Notes"
    elif style == "terminal":
        bg_color = (30, 30, 30)
        text_color = (0, 255, 0)
        header = "root@system:~#"
    elif style == "whiteboard":
        bg_color = (250, 248, 240)
        text_color = (40, 40, 40)
        header = "Meeting Notes"
    else:
        bg_color = (255, 255, 255)
        text_color = (0, 0, 0)
        header = ""
 
    img = Image.new("RGB", (width, height), color=bg_color)
    draw = ImageDraw.Draw(img)
 
    try:
        font = ImageFont.truetype(
            "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 18
        )
        header_font = ImageFont.truetype(
            "/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 22
        )
    except OSError:
        font = ImageFont.load_default()
        header_font = font
 
    # Draw header
    draw.text((40, 30), header, fill=text_color, font=header_font)
    draw.line([(40, 60), (width - 40, 60)], fill=text_color, width=1)
 
    # Draw payload text
    y = 80
    import textwrap
    lines = textwrap.wrap(payload, width=80)
    for line in lines:
        draw.text((40, y), line, fill=text_color, font=font)
        y += 28
 
    return np.array(img)[:, :, ::-1]  # Convert RGB to BGR for OpenCV

Temporal Manipulation 攻擊

Event Sequence Manipulation

Beyond injecting content into individual frames, attackers can manipulate the temporal sequence of events to alter 模型's 理解 of what happened in the video.

class TemporalManipulator:
    """Manipulate the temporal structure of videos to mislead
    video 理解 models about the sequence of events.
 
    These attacks 利用 模型's reliance on sampled frames
    to reconstruct temporal narratives. By reordering, duplicating,
    or removing frames at strategic positions, 模型 can be
    led to incorrect conclusions about cause and effect.
    """
 
    def __init__(self, video_path: str):
        self.video_path = video_path
        cap = cv2.VideoCapture(video_path)
        self.fps = cap.get(cv2.CAP_PROP_FPS)
        self.total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        cap.release()
 
    def reverse_segment(
        self,
        start_time: float,
        end_time: float,
        output_path: str,
    ) -> dict:
        """Reverse a segment of the video to alter perceived causality.
 
        If 模型 samples frames from the reversed segment,
        it may conclude that events happened in the opposite order.
        """
        start_frame = int(start_time * self.fps)
        end_frame = int(end_time * self.fps)
 
        cap = cv2.VideoCapture(self.video_path)
        width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
 
        # Read all frames in the segment
        segment_frames = []
        frame_idx = 0
        all_frames = []
 
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            all_frames.append(frame)
            if start_frame <= frame_idx <= end_frame:
                segment_frames.append(frame)
            frame_idx += 1
 
        cap.release()
 
        # Reverse the segment
        segment_frames.reverse()
 
        # Write 輸出
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (width, height))
 
        seg_idx = 0
        for i, frame in enumerate(all_frames):
            if start_frame <= i <= end_frame:
                out.write(segment_frames[seg_idx])
                seg_idx += 1
            else:
                out.write(frame)
 
        out.release()
 
        return {
            "output_path": output_path,
            "reversed_segment": f"{start_time:.1f}s - {end_time:.1f}s",
            "frames_affected": len(segment_frames),
        }
 
    def duplicate_frame_at_sample_points(
        self,
        source_frame_index: int,
        sampling_strategy: SamplingStrategy,
        num_model_samples: int,
        output_path: str,
    ) -> dict:
        """Replace frames at model sample points with a duplicate of
        a specific frame, causing 模型 to over-weight that moment.
 
        This can make 模型 believe a specific event in the video
        is the dominant or only event, suppressing its 理解
        of other events.
        """
        sampled_indices = compute_sampled_frame_indices(
            total_frames=self.total_frames,
            strategy=sampling_strategy,
            num_samples=num_model_samples,
            fps=self.fps,
        )
 
        cap = cv2.VideoCapture(self.video_path)
        width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
 
        # Read source frame
        cap.set(cv2.CAP_PROP_POS_FRAMES, source_frame_index)
        ret, source_frame = cap.read()
        if not ret:
            cap.release()
            raise ValueError(f"Could not read frame {source_frame_index}")
 
        cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        out = cv2.VideoWriter(output_path, fourcc, self.fps, (width, height))
 
        frame_idx = 0
        replaced = 0
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            if frame_idx in sampled_indices and frame_idx != source_frame_index:
                out.write(source_frame)
                replaced += 1
            else:
                out.write(frame)
            frame_idx += 1
 
        cap.release()
        out.release()
 
        return {
            "output_path": output_path,
            "source_frame": source_frame_index,
            "replaced_frames": replaced,
            "model_will_see": "Same frame repeated at most sample points",
        }

防禦策略 for Video Systems

Multi-Frame Consistency Checking

class VideoConsistencyChecker:
    """Check video frames for injection and manipulation attacks.
 
    Compares adjacent frames to detect abrupt content changes
    that indicate frame injection. Analyzes the distribution
    of visual features across sampled frames to detect
    temporal manipulation.
    """
 
    def __init__(self, anomaly_threshold: float = 0.3):
        self.anomaly_threshold = anomaly_threshold
 
    def check_frame_consistency(
        self, frames: list[np.ndarray]
    ) -> dict:
        """Check for abrupt visual changes between consecutive frames.
 
        Legitimate video has smooth transitions between frames.
        Injected frames create sharp discontinuities in pixel
        statistics, color histograms, and structural features.
        """
        anomalies = []
 
        for i in range(1, len(frames)):
            prev_frame = frames[i - 1].astype(float)
            curr_frame = frames[i].astype(float)
 
            # Pixel-level difference
            pixel_diff = np.mean(np.abs(prev_frame - curr_frame)) / 255.0
 
            # Histogram difference
            hist_prev = cv2.calcHist([frames[i-1]], [0], None, [256], [0, 256])
            hist_curr = cv2.calcHist([frames[i]], [0], None, [256], [0, 256])
            hist_diff = cv2.compareHist(
                hist_prev.astype(np.float32),
                hist_curr.astype(np.float32),
                cv2.HISTCMP_BHATTACHARYYA,
            )
 
            # Combined anomaly score
            anomaly_score = 0.6 * pixel_diff + 0.4 * hist_diff
 
            if anomaly_score > self.anomaly_threshold:
                anomalies.append({
                    "frame_index": i,
                    "anomaly_score": float(anomaly_score),
                    "pixel_diff": float(pixel_diff),
                    "histogram_diff": float(hist_diff),
                })
 
        return {
            "total_frames_checked": len(frames) - 1,
            "anomalies_detected": len(anomalies),
            "anomaly_details": anomalies,
            "recommendation": (
                "BLOCK" if len(anomalies) > 2
                else "REVIEW" if len(anomalies) > 0
                else "PASS"
            ),
        }

測試 Methodology

When 紅隊演練 video 理解 systems:

Determine 模型's sampling strategy: Send videos with frame counters or unique patterns per frame and ask 模型 to describe what it sees. This reveals which frames are sampled.
測試 single-frame injection: Insert one 對抗性 frame at a known sample point. Verify 模型 reads the injected content.
測試 subliminal injection: Insert 對抗性 frames for 1-2 frames (33-66ms) and verify 模型 processes them. 測試 whether human reviewers can detect them during normal playback.
測試 temporal manipulation: Reverse or reorder segments and check whether 模型's 理解 of event sequence is altered.
測試 subtitle injection: If 系統 processes subtitle tracks, inject 對抗性 text via SRT or VTT subtitle files.
測試 combined audio-video attacks: Combine visual frame injection with hidden audio commands for maximum impact.

參考文獻

Li, Y., et al. "Video-based 對抗性攻擊 on Multimodal Large Language Models." arXiv preprint (2024).
Wang, Z., et al. "VideoAdvBench: A Benchmark for 對抗性 Robustness of Video 理解 Models." NeurIPS (2024).
Carlini, N., et al. "Are aligned neural networks adversarially aligned?" arXiv preprint arXiv:2306.15447 (2023).
Zou, A., et al. "Universal and Transferable 對抗性攻擊 on Aligned Language Models." arXiv preprint arXiv:2307.15043 (2023).
MITRE ATLAS framework — https://atlas.mitre.org
OWASP LLM Top 10 — https://owasp.org/www-project-top-10-for-large-language-model-applications/

Knowledge Check

Why is frame injection particularly effective against video 理解 models?

Knowledge Check

What is the most reliable first step when 紅隊演練 a video 理解 system?

攻擊s on Video Understanding 模型s

Related articles

攻擊s on Video Understanding 模型s

Related articles