Adversarial Input Forensics

intermediate16 min readUpdated 2026-03-20

Forensic techniques for identifying, reconstructing, and analyzing adversarial inputs used to manipulate AI system behavior.

ai-forensics-ir adversarial-inputs forensics evasion-attacks

Overview

Adversarial input forensics is the discipline of recovering, analyzing, and attributing malicious inputs that were crafted to cause AI systems to behave in unintended ways. Unlike traditional software exploitation where payloads leave clear signatures in memory or on disk, adversarial inputs to AI systems can be subtle perturbations -- imperceptible pixel shifts in images, carefully selected token substitutions in text, or statistically crafted feature vectors in tabular data -- that leave minimal forensic traces unless specific instrumentation is in place.

This article covers the complete forensic lifecycle for adversarial inputs: from detection and evidence preservation through reconstruction, analysis, and attribution. We draw on established frameworks including MITRE ATLAS (Adversarial Threat Landscape for AI Systems) and the NIST AI Risk Management Framework to ground forensic procedures in recognized standards.

The forensic investigator's challenge is compounded by the fact that adversarial inputs are, by design, difficult to distinguish from legitimate data. A well-crafted adversarial image may look identical to a human observer, and an adversarial text prompt may read as perfectly natural language. The forensic process must therefore rely on computational analysis, model introspection, and statistical anomaly detection rather than human inspection alone.

Adversarial Input Taxonomy

Before diving into forensic techniques, investigators must understand the taxonomy of adversarial inputs they may encounter. Each category presents different forensic challenges and requires different collection and analysis approaches.

Perturbation-Based Attacks

Perturbation attacks apply small, calculated modifications to legitimate inputs. In computer vision, these are typically L-p norm bounded perturbations computed via gradient-based methods such as FGSM (Fast Gradient Sign Method), PGD (Projected Gradient Descent), or C&W (Carlini & Wagner) attacks. In NLP, perturbations manifest as character-level swaps, word-level substitutions using synonym sets, or sentence-level paraphrasing that preserves semantic meaning while altering model predictions.

Attack Method	Domain	Perturbation Type	Forensic Detectability
FGSM	Vision	L-infinity bounded pixel shifts	Medium -- uniform noise pattern
PGD	Vision	Iterative L-p bounded	Low -- optimized to minimize detectability
C&W	Vision	L-2 optimized	Low -- specifically minimizes perturbation magnitude
TextFooler	NLP	Word-level substitution	Medium -- unusual synonym choices
DeepWordBug	NLP	Character-level perturbation	High -- typo-like artifacts
Universal Perturbation	Vision	Single perturbation for many inputs	High -- reused pattern across samples

Patch-Based Attacks

Adversarial patches are localized, visually conspicuous modifications applied to a small region of an input. Unlike perturbation attacks, patches do not attempt to be imperceptible. Instead, they exploit the model's sensitivity to specific spatial patterns. In physical-world attacks, patches can be printed and placed in the environment -- on stop signs, clothing, or other objects within a camera's field of view.

Input Transformation Attacks

These attacks apply geometric or color-space transformations that are technically within the distribution of natural variation but push the input across a decision boundary. Rotations, scaling, brightness shifts, and JPEG compression artifacts can all be weaponized. These are forensically challenging because the transformations are individually innocuous.

Adversarial Text Prompts

In the context of LLMs, adversarial inputs include prompt injections, jailbreak sequences, and obfuscated instructions. These are covered in depth in the Prompt Injection Forensics article, but the forensic framework here applies to the broader class of adversarial text inputs targeting any NLP model.

Evidence Collection and Preservation

Logging Architecture for Adversarial Input Capture

Effective forensic investigation depends on logging infrastructure that captures sufficient data at inference time. The following Python module demonstrates a forensic logging wrapper that records the metadata needed for post-incident analysis.

import hashlib
import json
import time
import logging
from dataclasses import dataclass, field, asdict
from typing import Any
from pathlib import Path
 
logger = logging.getLogger("adversarial_forensics")
 
@dataclass
class InferenceRecord:
    """Forensic record of a single inference request."""
    request_id: str
    timestamp: float
    input_hash_sha256: str
    input_size_bytes: int
    input_modality: str  # "text", "image", "tabular", "audio"
    model_id: str
    model_version: str
    prediction: Any = None
    confidence_scores: list[float] = field(default_factory=list)
    latency_ms: float = 0.0
    input_metadata: dict = field(default_factory=dict)
    anomaly_flags: list[str] = field(default_factory=list)
 
class ForensicInferenceLogger:
    """Wraps model inference with forensic-grade logging."""
 
    def __init__(self, log_dir: str, model_id: str, model_version: str):
        self.log_dir = Path(log_dir)
        self.log_dir.mkdir(parents=True, exist_ok=True)
        self.model_id = model_id
        self.model_version = model_version
 
    def compute_input_hash(self, raw_input: bytes) -> str:
        return hashlib.sha256(raw_input).hexdigest()
 
    def log_inference(self, record: InferenceRecord) -> None:
        log_path = self.log_dir / f"{record.request_id}.json"
        log_path.write_text(json.dumps(asdict(record), default=str))
        logger.info(
            "Logged inference %s (anomaly_flags=%s)",
            record.request_id,
            record.anomaly_flags,
        )
 
    def create_record(
        self,
        request_id: str,
        raw_input: bytes,
        modality: str,
        metadata: dict | None = None,
    ) -> InferenceRecord:
        return InferenceRecord(
            request_id=request_id,
            timestamp=time.time(),
            input_hash_sha256=self.compute_input_hash(raw_input),
            input_size_bytes=len(raw_input),
            input_modality=modality,
            model_id=self.model_id,
            model_version=self.model_version,
            input_metadata=metadata or {},
        )

Chain of Custody for Digital Adversarial Samples

When adversarial inputs are identified, they become evidence. The chain of custody must be maintained following digital forensics best practices adapted for AI artifacts.

Immediate preservation: Hash the raw input bytes using SHA-256 before any transformation. Store the hash in immutable, append-only storage.
Raw input archival: Store the unmodified input in write-once storage. For images, preserve the original file with EXIF data intact. For text, preserve raw bytes including encoding.
Context capture: Record the full inference context -- system prompt, conversation history, retrieval-augmented generation (RAG) context, and any tool call results that accompanied the adversarial input.
Model state snapshot: Record the exact model version, checkpoint hash, configuration parameters, and any adapter weights (LoRA, QLoRA) active at the time of the incident.

Detection Techniques

Statistical Anomaly Detection

The first layer of adversarial input detection relies on statistical properties of the input that deviate from the expected distribution. These methods do not require knowledge of the specific attack used.

import numpy as np
from scipy import stats
 
class StatisticalAdversarialDetector:
    """Detect adversarial inputs using statistical distribution analysis."""
 
    def __init__(self, reference_stats: dict):
        """
        Initialize with reference statistics computed from known-clean data.
 
        Args:
            reference_stats: Dict with keys 'mean', 'std', 'kurtosis',
                             'skewness' computed from clean input distribution.
        """
        self.reference = reference_stats
 
    def analyze_image_input(self, pixel_array: np.ndarray) -> dict:
        """
        Analyze an image input for adversarial perturbation indicators.
 
        Returns a dict of anomaly scores and flags.
        """
        results = {}
 
        # Check pixel value distribution against reference
        flat = pixel_array.flatten().astype(np.float64)
        results["mean_deviation"] = abs(float(np.mean(flat)) - self.reference["mean"])
        results["std_deviation"] = abs(float(np.std(flat)) - self.reference["std"])
 
        # Kurtosis analysis -- adversarial perturbations often alter
        # the tails of the pixel distribution
        input_kurtosis = float(stats.kurtosis(flat))
        results["kurtosis_deviation"] = abs(
            input_kurtosis - self.reference["kurtosis"]
        )
 
        # High-frequency energy analysis via discrete differences
        # Adversarial perturbations often increase high-frequency content
        if pixel_array.ndim >= 2:
            dx = np.diff(pixel_array, axis=-1)
            dy = np.diff(pixel_array, axis=-2)
            hf_energy = float(np.mean(dx ** 2) + np.mean(dy ** 2))
            results["high_freq_energy"] = hf_energy
 
        # Compute overall anomaly score (weighted combination)
        anomaly_score = (
            0.2 * min(results["mean_deviation"] / 10.0, 1.0)
            + 0.2 * min(results["std_deviation"] / 10.0, 1.0)
            + 0.3 * min(results["kurtosis_deviation"] / 2.0, 1.0)
            + 0.3 * min(results.get("high_freq_energy", 0) / 1000.0, 1.0)
        )
        results["anomaly_score"] = float(anomaly_score)
        results["flagged"] = anomaly_score > 0.65
 
        return results
 
    def analyze_text_input(self, tokens: list[str], token_ids: list[int]) -> dict:
        """Analyze text input for adversarial token patterns."""
        results = {}
 
        # Perplexity-based analysis: adversarial text often has unusual
        # token co-occurrence patterns
        unique_ratio = len(set(token_ids)) / max(len(token_ids), 1)
        results["unique_token_ratio"] = float(unique_ratio)
 
        # Check for character-level perturbation indicators
        unicode_anomalies = sum(
            1 for t in tokens if any(ord(c) > 127 for c in t)
        )
        results["unicode_anomaly_count"] = unicode_anomalies
 
        # Check for unusual whitespace or control characters
        control_chars = sum(
            1 for t in tokens
            for c in t
            if ord(c) < 32 and c not in ('\n', '\t', '\r')
        )
        results["control_char_count"] = control_chars
 
        results["flagged"] = (
            unicode_anomalies > len(tokens) * 0.1
            or control_chars > 0
        )
        return results

Model Confidence Analysis

Adversarial inputs often produce characteristic patterns in model confidence scores. A correctly classified benign input typically produces a high-confidence prediction with a clear margin between the top-1 and top-2 classes. Adversarial inputs, by contrast, may produce predictions with atypical confidence distributions.

def analyze_confidence_distribution(
    confidence_scores: list[float],
    threshold_margin: float = 0.15,
    threshold_entropy: float = 1.5,
) -> dict:
    """
    Analyze model confidence scores for adversarial indicators.
 
    Adversarial inputs often produce:
    - Low margin between top-1 and top-2 predictions
    - High prediction entropy
    - Confidence values clustered near decision boundaries
    """
    scores = sorted(confidence_scores, reverse=True)
    top1 = scores[0] if scores else 0.0
    top2 = scores[1] if len(scores) > 1 else 0.0
 
    margin = top1 - top2
 
    # Shannon entropy of the prediction distribution
    probs = np.array(confidence_scores, dtype=np.float64)
    probs = probs / probs.sum()  # normalize
    entropy = float(-np.sum(probs * np.log2(probs + 1e-10)))
 
    return {
        "top1_confidence": top1,
        "top2_confidence": top2,
        "margin": margin,
        "entropy": entropy,
        "low_margin_flag": margin < threshold_margin,
        "high_entropy_flag": entropy > threshold_entropy,
        "adversarial_likelihood": "high" if (
            margin < threshold_margin and entropy > threshold_entropy
        ) else "medium" if (
            margin < threshold_margin or entropy > threshold_entropy
        ) else "low",
    }

Feature Squeezing Detection

Feature squeezing, introduced by Xu et al. (2018), detects adversarial inputs by comparing a model's prediction on the original input against predictions on "squeezed" versions of the input -- versions where the input space has been reduced through operations like bit-depth reduction or spatial smoothing. If the predictions differ significantly, the input is likely adversarial.

from typing import Callable
 
import numpy as np
 
def feature_squeezing_detect(
    input_array: np.ndarray,
    predict_fn: Callable[[np.ndarray], np.ndarray],
    bit_depth: int = 4,
    spatial_sigma: float = 1.0,
) -> dict:
    """
    Detect adversarial inputs using feature squeezing.
 
    Based on Xu et al. 2018 -- 'Feature Squeezing: Detecting
    Adversarial Examples in Deep Neural Networks' (NDSS 2018).
    """
    from scipy.ndimage import gaussian_filter
 
    original_pred = predict_fn(input_array)
 
    # Squeeze 1: Bit-depth reduction
    max_val = input_array.max()
    levels = 2 ** bit_depth
    squeezed_bits = np.round(input_array * levels / max_val) * max_val / levels
    pred_bits = predict_fn(squeezed_bits)
 
    # Squeeze 2: Spatial smoothing
    squeezed_smooth = gaussian_filter(input_array, sigma=spatial_sigma)
    pred_smooth = predict_fn(squeezed_smooth)
 
    # L1 distance between original and squeezed predictions
    dist_bits = float(np.max(np.abs(original_pred - pred_bits)))
    dist_smooth = float(np.max(np.abs(original_pred - pred_smooth)))
 
    max_dist = max(dist_bits, dist_smooth)
 
    return {
        "bit_squeeze_distance": dist_bits,
        "smooth_squeeze_distance": dist_smooth,
        "max_squeeze_distance": max_dist,
        "adversarial_detected": max_dist > 0.1,
    }

Reconstruction and Analysis

Perturbation Extraction

Once an adversarial input is identified, the next forensic step is to extract and characterize the adversarial perturbation. This requires access to either the original clean input (if available) or a denoised reconstruction.

def extract_perturbation(
    adversarial_input: np.ndarray,
    clean_reference: np.ndarray,
) -> dict:
    """
    Extract and characterize the adversarial perturbation.
 
    Computes the difference between the adversarial and clean inputs,
    then analyzes its properties for attribution purposes.
    """
    perturbation = adversarial_input.astype(np.float64) - clean_reference.astype(np.float64)
 
    analysis = {
        "l2_norm": float(np.linalg.norm(perturbation)),
        "linf_norm": float(np.max(np.abs(perturbation))),
        "l1_norm": float(np.sum(np.abs(perturbation))),
        "mean_perturbation": float(np.mean(np.abs(perturbation))),
        "sparsity": float(np.mean(np.abs(perturbation) < 1e-6)),
        "spatial_distribution": _analyze_spatial_distribution(perturbation),
    }
 
    # Classify likely attack method based on perturbation characteristics
    if analysis["linf_norm"] < 0.05 and analysis["sparsity"] < 0.1:
        analysis["likely_method"] = "PGD or FGSM (dense, small perturbation)"
    elif analysis["sparsity"] > 0.9:
        analysis["likely_method"] = "Sparse attack (few pixels modified)"
    elif analysis["l2_norm"] < 2.0:
        analysis["likely_method"] = "C&W L2 attack (optimized small L2)"
    else:
        analysis["likely_method"] = "Unknown or patch-based attack"
 
    return analysis
 
def _analyze_spatial_distribution(perturbation: np.ndarray) -> dict:
    """Analyze where in the input the perturbation is concentrated."""
    if perturbation.ndim < 2:
        return {"type": "1d", "peak_index": int(np.argmax(np.abs(perturbation)))}
 
    energy = np.sum(perturbation ** 2, axis=-1) if perturbation.ndim == 3 else perturbation ** 2
    h, w = energy.shape
    quadrants = {
        "top_left": float(energy[:h//2, :w//2].sum()),
        "top_right": float(energy[:h//2, w//2:].sum()),
        "bottom_left": float(energy[h//2:, :w//2].sum()),
        "bottom_right": float(energy[h//2:, w//2:].sum()),
    }
    total = sum(quadrants.values())
    if total > 0:
        quadrants = {k: v / total for k, v in quadrants.items()}
 
    return quadrants

Attack Method Fingerprinting

Different adversarial attack algorithms leave characteristic fingerprints in the perturbation structure. Forensic investigators can use these fingerprints to attribute an attack to a specific method or toolset.

FGSM fingerprint: Perturbation values are clustered at +epsilon and -epsilon with very few intermediate values, since FGSM takes a single gradient step with the sign function.

PGD fingerprint: Perturbation values are more continuously distributed but still bounded by epsilon. The distribution is smoother than FGSM because of the iterative optimization.

C&W fingerprint: Perturbation values follow a near-Gaussian distribution centered at zero, with the L2 norm being notably smaller than the L-infinity norm would suggest. The perturbation is highly structured, concentrating on semantically important regions.

AutoAttack fingerprint: Combines multiple attack strategies; the perturbation may show characteristics of multiple methods depending on which sub-attack succeeded.

MITRE ATLAS Mapping

Forensic findings should be mapped to MITRE ATLAS techniques to enable cross-organization threat intelligence sharing. The following table maps common adversarial input attack types to ATLAS technique identifiers.

Forensic Finding	ATLAS Technique	ATLAS ID
Gradient-based perturbation detected	Craft Adversarial Data > White-Box Optimization	AML.T0043.000
Black-box query pattern in logs	Craft Adversarial Data > Black-Box Optimization	AML.T0043.001
Physical-world adversarial patch	Craft Adversarial Data > Physical Environment	AML.T0043.003
Adversarial text substitutions	Craft Adversarial Data > Insert Backdoor Trigger	AML.T0043.002
Transfer attack from surrogate model	Develop Capabilities > Adversarial ML Attack Development	AML.T0018
Evasion of input validation	Evade ML Model	AML.T0015

Forensic Investigation Workflow

Step 1: Initial Triage

When an anomalous model behavior is reported, the investigator first determines whether adversarial input is a plausible cause. This involves checking:

Model confidence patterns at the time of the incident
Input distribution statistics versus the baseline
Whether the model output was incorrect or harmful
Temporal clustering of anomalous predictions

Step 2: Evidence Preservation

Execute the evidence collection protocol described above. Ensure all raw inputs, model outputs, and system state are preserved with cryptographic integrity verification.

Step 3: Adversarial Confirmation

Apply the detection methods (statistical analysis, feature squeezing, confidence analysis) to confirm that the input was adversarial rather than a natural edge case or data quality issue.

Step 4: Perturbation Extraction and Analysis

Extract the adversarial perturbation and characterize it using the methods described above. This step produces the technical evidence needed for attribution.

Step 5: Attack Attribution

Map the perturbation characteristics to known attack methods. Cross-reference with threat intelligence on known adversarial toolkits (Adversarial Robustness Toolbox, Foolbox, CleverHans) and their default parameters.

# Example: Using ART (Adversarial Robustness Toolbox) to test
# whether a sample matches known attack profiles
pip install adversarial-robustness-toolbox
 
# Run the ART detection suite against a suspicious sample
python -c "
from art.defences.detector.evasion import BinaryInputDetector
from art.estimators.classification import PyTorchClassifier
import torch
import numpy as np
 
# Load your model and suspicious sample
# detector = BinaryInputDetector(classifier)
# result = detector.detect(suspicious_input)
print('ART detection pipeline initialized')
"

Step 6: Reporting and Remediation

Produce a forensic report that includes:

Timeline of the incident
Technical analysis of the adversarial input
ATLAS technique mapping
Recommended mitigations (adversarial training, input preprocessing, ensemble defenses)
Indicators of compromise (IoCs) for sharing with peer organizations

Tooling Reference

Several established tools support adversarial input forensics:

Adversarial Robustness Toolbox (ART) by IBM Research: Provides both attack simulation and detection capabilities. Available at github.com/Trusted-AI/adversarial-robustness-toolbox.
Foolbox: A Python library for creating and analyzing adversarial examples. Supports multiple deep learning frameworks.
CleverHans: Originally developed by Goodfellow et al. for adversarial example research.
Counterfit by Microsoft: An automation tool for assessing ML model security, built on top of ART and TextAttack.
MITRE ATLAS Navigator: For mapping forensic findings to standardized threat techniques.

Case Study: Image Classification Evasion

Consider a production image classification system used for content moderation that begins misclassifying harmful images as benign. The forensic investigation proceeds as follows:

Detection: Monitoring dashboards show a spike in misclassification rate from 2% to 12% over a 4-hour window, concentrated on a specific content category.
Evidence collection: The forensic team exports all inference records from the affected time window, including raw images, model predictions, and confidence scores.
Statistical analysis: The high-frequency energy analysis reveals that flagged images contain elevated noise in the 10-50 cycle/pixel frequency band, inconsistent with natural camera noise or JPEG artifacts.
Perturbation extraction: Comparing flagged images against their nearest neighbors in the clean training set reveals a consistent perturbation pattern with L-infinity norm of 8/255 -- a common epsilon value for PGD attacks.
Attribution: The perturbation characteristics match the default PGD configuration in the Adversarial Robustness Toolbox, suggesting the attacker used an off-the-shelf tool without customization.
Remediation: The team deploys adversarial training using PGD-generated examples and adds a feature squeezing preprocessor as an additional detection layer.

Advanced Topics

Transferability Analysis

Adversarial inputs crafted against one model often transfer to other models -- a property known as adversarial transferability. Forensic investigators should test recovered adversarial samples against other models in the organization's deployment to assess the scope of the threat.

Adaptive Attacks Against Detectors

Sophisticated attackers may craft adversarial inputs that simultaneously evade both the target model and adversarial detection mechanisms. This adaptive attack scenario requires defense-in-depth strategies and continuous updates to detection methods. The forensic investigator should be aware that the absence of detection does not guarantee the absence of adversarial inputs.

Temporal Pattern Analysis

When adversarial inputs are delivered as part of a sustained campaign, temporal analysis can reveal patterns in attack timing, frequency, and evolution. Attackers may start with crude attacks and progressively refine their techniques based on observed success rates, creating an observable escalation pattern in the forensic record.

References

Xu, W., Evans, D., & Qi, Y. (2018). Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks. Proceedings of the 2018 Network and Distributed System Security Symposium (NDSS). https://doi.org/10.14722/ndss.2018.23198
MITRE ATLAS. (2024). Adversarial Threat Landscape for Artificial Intelligence Systems. https://atlas.mitre.org/
Nicolae, M.-I. et al. (2018). Adversarial Robustness Toolbox v1.0.0. arXiv preprint arXiv:1807.01069. https://arxiv.org/abs/1807.01069
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1412.6572

Edit this page on GitHub

Adversarial Input Forensics

intermediate16 min readUpdated 2026-03-20

Forensic techniques for identifying, reconstructing, and analyzing adversarial inputs used to manipulate AI system behavior.

ai-forensics-ir adversarial-inputs forensics evasion-attacks

Overview

Adversarial Input Taxonomy

Perturbation-Based Attacks

Attack Method	Domain	Perturbation Type	Forensic Detectability
FGSM	Vision	L-infinity bounded pixel shifts	Medium -- uniform noise pattern
PGD	Vision	Iterative L-p bounded	Low -- optimized to minimize detectability
C&W	Vision	L-2 optimized	Low -- specifically minimizes perturbation magnitude
TextFooler	NLP	Word-level substitution	Medium -- unusual synonym choices
DeepWordBug	NLP	Character-level perturbation	High -- typo-like artifacts
Universal Perturbation	Vision	Single perturbation for many inputs	High -- reused pattern across samples

Patch-Based Attacks

Input Transformation Attacks

Adversarial Text Prompts

Evidence Collection and Preservation

Logging Architecture for Adversarial Input Capture

import hashlib
import json
import time
import logging
from dataclasses import dataclass, field, asdict
from typing import Any
from pathlib import Path
 
logger = logging.getLogger("adversarial_forensics")
 
@dataclass
class InferenceRecord:
    """Forensic record of a single inference request."""
    request_id: str
    timestamp: float
    input_hash_sha256: str
    input_size_bytes: int
    input_modality: str  # "text", "image", "tabular", "audio"
    model_id: str
    model_version: str
    prediction: Any = None
    confidence_scores: list[float] = field(default_factory=list)
    latency_ms: float = 0.0
    input_metadata: dict = field(default_factory=dict)
    anomaly_flags: list[str] = field(default_factory=list)
 
class ForensicInferenceLogger:
    """Wraps model inference with forensic-grade logging."""
 
    def __init__(self, log_dir: str, model_id: str, model_version: str):
        self.log_dir = Path(log_dir)
        self.log_dir.mkdir(parents=True, exist_ok=True)
        self.model_id = model_id
        self.model_version = model_version
 
    def compute_input_hash(self, raw_input: bytes) -> str:
        return hashlib.sha256(raw_input).hexdigest()
 
    def log_inference(self, record: InferenceRecord) -> None:
        log_path = self.log_dir / f"{record.request_id}.json"
        log_path.write_text(json.dumps(asdict(record), default=str))
        logger.info(
            "Logged inference %s (anomaly_flags=%s)",
            record.request_id,
            record.anomaly_flags,
        )
 
    def create_record(
        self,
        request_id: str,
        raw_input: bytes,
        modality: str,
        metadata: dict | None = None,
    ) -> InferenceRecord:
        return InferenceRecord(
            request_id=request_id,
            timestamp=time.time(),
            input_hash_sha256=self.compute_input_hash(raw_input),
            input_size_bytes=len(raw_input),
            input_modality=modality,
            model_id=self.model_id,
            model_version=self.model_version,
            input_metadata=metadata or {},
        )

Chain of Custody for Digital Adversarial Samples

When adversarial inputs are identified, they become evidence. The chain of custody must be maintained following digital forensics best practices adapted for AI artifacts.

Immediate preservation: Hash the raw input bytes using SHA-256 before any transformation. Store the hash in immutable, append-only storage.
Raw input archival: Store the unmodified input in write-once storage. For images, preserve the original file with EXIF data intact. For text, preserve raw bytes including encoding.
Context capture: Record the full inference context -- system prompt, conversation history, retrieval-augmented generation (RAG) context, and any tool call results that accompanied the adversarial input.
Model state snapshot: Record the exact model version, checkpoint hash, configuration parameters, and any adapter weights (LoRA, QLoRA) active at the time of the incident.

Detection Techniques

Statistical Anomaly Detection

import numpy as np
from scipy import stats
 
class StatisticalAdversarialDetector:
    """Detect adversarial inputs using statistical distribution analysis."""
 
    def __init__(self, reference_stats: dict):
        """
        Initialize with reference statistics computed from known-clean data.
 
        Args:
            reference_stats: Dict with keys 'mean', 'std', 'kurtosis',
                             'skewness' computed from clean input distribution.
        """
        self.reference = reference_stats
 
    def analyze_image_input(self, pixel_array: np.ndarray) -> dict:
        """
        Analyze an image input for adversarial perturbation indicators.
 
        Returns a dict of anomaly scores and flags.
        """
        results = {}
 
        # Check pixel value distribution against reference
        flat = pixel_array.flatten().astype(np.float64)
        results["mean_deviation"] = abs(float(np.mean(flat)) - self.reference["mean"])
        results["std_deviation"] = abs(float(np.std(flat)) - self.reference["std"])
 
        # Kurtosis analysis -- adversarial perturbations often alter
        # the tails of the pixel distribution
        input_kurtosis = float(stats.kurtosis(flat))
        results["kurtosis_deviation"] = abs(
            input_kurtosis - self.reference["kurtosis"]
        )
 
        # High-frequency energy analysis via discrete differences
        # Adversarial perturbations often increase high-frequency content
        if pixel_array.ndim >= 2:
            dx = np.diff(pixel_array, axis=-1)
            dy = np.diff(pixel_array, axis=-2)
            hf_energy = float(np.mean(dx ** 2) + np.mean(dy ** 2))
            results["high_freq_energy"] = hf_energy
 
        # Compute overall anomaly score (weighted combination)
        anomaly_score = (
            0.2 * min(results["mean_deviation"] / 10.0, 1.0)
            + 0.2 * min(results["std_deviation"] / 10.0, 1.0)
            + 0.3 * min(results["kurtosis_deviation"] / 2.0, 1.0)
            + 0.3 * min(results.get("high_freq_energy", 0) / 1000.0, 1.0)
        )
        results["anomaly_score"] = float(anomaly_score)
        results["flagged"] = anomaly_score > 0.65
 
        return results
 
    def analyze_text_input(self, tokens: list[str], token_ids: list[int]) -> dict:
        """Analyze text input for adversarial token patterns."""
        results = {}
 
        # Perplexity-based analysis: adversarial text often has unusual
        # token co-occurrence patterns
        unique_ratio = len(set(token_ids)) / max(len(token_ids), 1)
        results["unique_token_ratio"] = float(unique_ratio)
 
        # Check for character-level perturbation indicators
        unicode_anomalies = sum(
            1 for t in tokens if any(ord(c) > 127 for c in t)
        )
        results["unicode_anomaly_count"] = unicode_anomalies
 
        # Check for unusual whitespace or control characters
        control_chars = sum(
            1 for t in tokens
            for c in t
            if ord(c) < 32 and c not in ('\n', '\t', '\r')
        )
        results["control_char_count"] = control_chars
 
        results["flagged"] = (
            unicode_anomalies > len(tokens) * 0.1
            or control_chars > 0
        )
        return results

Model Confidence Analysis

def analyze_confidence_distribution(
    confidence_scores: list[float],
    threshold_margin: float = 0.15,
    threshold_entropy: float = 1.5,
) -> dict:
    """
    Analyze model confidence scores for adversarial indicators.
 
    Adversarial inputs often produce:
    - Low margin between top-1 and top-2 predictions
    - High prediction entropy
    - Confidence values clustered near decision boundaries
    """
    scores = sorted(confidence_scores, reverse=True)
    top1 = scores[0] if scores else 0.0
    top2 = scores[1] if len(scores) > 1 else 0.0
 
    margin = top1 - top2
 
    # Shannon entropy of the prediction distribution
    probs = np.array(confidence_scores, dtype=np.float64)
    probs = probs / probs.sum()  # normalize
    entropy = float(-np.sum(probs * np.log2(probs + 1e-10)))
 
    return {
        "top1_confidence": top1,
        "top2_confidence": top2,
        "margin": margin,
        "entropy": entropy,
        "low_margin_flag": margin < threshold_margin,
        "high_entropy_flag": entropy > threshold_entropy,
        "adversarial_likelihood": "high" if (
            margin < threshold_margin and entropy > threshold_entropy
        ) else "medium" if (
            margin < threshold_margin or entropy > threshold_entropy
        ) else "low",
    }

Feature Squeezing Detection

from typing import Callable
 
import numpy as np
 
def feature_squeezing_detect(
    input_array: np.ndarray,
    predict_fn: Callable[[np.ndarray], np.ndarray],
    bit_depth: int = 4,
    spatial_sigma: float = 1.0,
) -> dict:
    """
    Detect adversarial inputs using feature squeezing.
 
    Based on Xu et al. 2018 -- 'Feature Squeezing: Detecting
    Adversarial Examples in Deep Neural Networks' (NDSS 2018).
    """
    from scipy.ndimage import gaussian_filter
 
    original_pred = predict_fn(input_array)
 
    # Squeeze 1: Bit-depth reduction
    max_val = input_array.max()
    levels = 2 ** bit_depth
    squeezed_bits = np.round(input_array * levels / max_val) * max_val / levels
    pred_bits = predict_fn(squeezed_bits)
 
    # Squeeze 2: Spatial smoothing
    squeezed_smooth = gaussian_filter(input_array, sigma=spatial_sigma)
    pred_smooth = predict_fn(squeezed_smooth)
 
    # L1 distance between original and squeezed predictions
    dist_bits = float(np.max(np.abs(original_pred - pred_bits)))
    dist_smooth = float(np.max(np.abs(original_pred - pred_smooth)))
 
    max_dist = max(dist_bits, dist_smooth)
 
    return {
        "bit_squeeze_distance": dist_bits,
        "smooth_squeeze_distance": dist_smooth,
        "max_squeeze_distance": max_dist,
        "adversarial_detected": max_dist > 0.1,
    }

Reconstruction and Analysis

Perturbation Extraction

def extract_perturbation(
    adversarial_input: np.ndarray,
    clean_reference: np.ndarray,
) -> dict:
    """
    Extract and characterize the adversarial perturbation.
 
    Computes the difference between the adversarial and clean inputs,
    then analyzes its properties for attribution purposes.
    """
    perturbation = adversarial_input.astype(np.float64) - clean_reference.astype(np.float64)
 
    analysis = {
        "l2_norm": float(np.linalg.norm(perturbation)),
        "linf_norm": float(np.max(np.abs(perturbation))),
        "l1_norm": float(np.sum(np.abs(perturbation))),
        "mean_perturbation": float(np.mean(np.abs(perturbation))),
        "sparsity": float(np.mean(np.abs(perturbation) < 1e-6)),
        "spatial_distribution": _analyze_spatial_distribution(perturbation),
    }
 
    # Classify likely attack method based on perturbation characteristics
    if analysis["linf_norm"] < 0.05 and analysis["sparsity"] < 0.1:
        analysis["likely_method"] = "PGD or FGSM (dense, small perturbation)"
    elif analysis["sparsity"] > 0.9:
        analysis["likely_method"] = "Sparse attack (few pixels modified)"
    elif analysis["l2_norm"] < 2.0:
        analysis["likely_method"] = "C&W L2 attack (optimized small L2)"
    else:
        analysis["likely_method"] = "Unknown or patch-based attack"
 
    return analysis
 
def _analyze_spatial_distribution(perturbation: np.ndarray) -> dict:
    """Analyze where in the input the perturbation is concentrated."""
    if perturbation.ndim < 2:
        return {"type": "1d", "peak_index": int(np.argmax(np.abs(perturbation)))}
 
    energy = np.sum(perturbation ** 2, axis=-1) if perturbation.ndim == 3 else perturbation ** 2
    h, w = energy.shape
    quadrants = {
        "top_left": float(energy[:h//2, :w//2].sum()),
        "top_right": float(energy[:h//2, w//2:].sum()),
        "bottom_left": float(energy[h//2:, :w//2].sum()),
        "bottom_right": float(energy[h//2:, w//2:].sum()),
    }
    total = sum(quadrants.values())
    if total > 0:
        quadrants = {k: v / total for k, v in quadrants.items()}
 
    return quadrants

Attack Method Fingerprinting

FGSM fingerprint: Perturbation values are clustered at +epsilon and -epsilon with very few intermediate values, since FGSM takes a single gradient step with the sign function.

PGD fingerprint: Perturbation values are more continuously distributed but still bounded by epsilon. The distribution is smoother than FGSM because of the iterative optimization.

AutoAttack fingerprint: Combines multiple attack strategies; the perturbation may show characteristics of multiple methods depending on which sub-attack succeeded.

MITRE ATLAS Mapping

Forensic Finding	ATLAS Technique	ATLAS ID
Gradient-based perturbation detected	Craft Adversarial Data > White-Box Optimization	AML.T0043.000
Black-box query pattern in logs	Craft Adversarial Data > Black-Box Optimization	AML.T0043.001
Physical-world adversarial patch	Craft Adversarial Data > Physical Environment	AML.T0043.003
Adversarial text substitutions	Craft Adversarial Data > Insert Backdoor Trigger	AML.T0043.002
Transfer attack from surrogate model	Develop Capabilities > Adversarial ML Attack Development	AML.T0018
Evasion of input validation	Evade ML Model	AML.T0015

Forensic Investigation Workflow

Step 1: Initial Triage

When an anomalous model behavior is reported, the investigator first determines whether adversarial input is a plausible cause. This involves checking:

Model confidence patterns at the time of the incident
Input distribution statistics versus the baseline
Whether the model output was incorrect or harmful
Temporal clustering of anomalous predictions

# Example: Using ART (Adversarial Robustness Toolbox) to test
# whether a sample matches known attack profiles
pip install adversarial-robustness-toolbox
 
# Run the ART detection suite against a suspicious sample
python -c "
from art.defences.detector.evasion import BinaryInputDetector
from art.estimators.classification import PyTorchClassifier
import torch
import numpy as np
 
# Load your model and suspicious sample
# detector = BinaryInputDetector(classifier)
# result = detector.detect(suspicious_input)
print('ART detection pipeline initialized')
"

Step 6: Reporting and Remediation

Produce a forensic report that includes:

Timeline of the incident
Technical analysis of the adversarial input
ATLAS technique mapping
Recommended mitigations (adversarial training, input preprocessing, ensemble defenses)
Indicators of compromise (IoCs) for sharing with peer organizations

Tooling Reference

Several established tools support adversarial input forensics:

Adversarial Robustness Toolbox (ART) by IBM Research: Provides both attack simulation and detection capabilities. Available at github.com/Trusted-AI/adversarial-robustness-toolbox.
Foolbox: A Python library for creating and analyzing adversarial examples. Supports multiple deep learning frameworks.
CleverHans: Originally developed by Goodfellow et al. for adversarial example research.
Counterfit by Microsoft: An automation tool for assessing ML model security, built on top of ART and TextAttack.
MITRE ATLAS Navigator: For mapping forensic findings to standardized threat techniques.

Case Study: Image Classification Evasion

Consider a production image classification system used for content moderation that begins misclassifying harmful images as benign. The forensic investigation proceeds as follows:

Detection: Monitoring dashboards show a spike in misclassification rate from 2% to 12% over a 4-hour window, concentrated on a specific content category.
Evidence collection: The forensic team exports all inference records from the affected time window, including raw images, model predictions, and confidence scores.
Statistical analysis: The high-frequency energy analysis reveals that flagged images contain elevated noise in the 10-50 cycle/pixel frequency band, inconsistent with natural camera noise or JPEG artifacts.
Perturbation extraction: Comparing flagged images against their nearest neighbors in the clean training set reveals a consistent perturbation pattern with L-infinity norm of 8/255 -- a common epsilon value for PGD attacks.
Attribution: The perturbation characteristics match the default PGD configuration in the Adversarial Robustness Toolbox, suggesting the attacker used an off-the-shelf tool without customization.
Remediation: The team deploys adversarial training using PGD-generated examples and adds a feature squeezing preprocessor as an additional detection layer.

Xu, W., Evans, D., & Qi, Y. (2018). Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks. Proceedings of the 2018 Network and Distributed System Security Symposium (NDSS). https://doi.org/10.14722/ndss.2018.23198
MITRE ATLAS. (2024). Adversarial Threat Landscape for Artificial Intelligence Systems. https://atlas.mitre.org/
Nicolae, M.-I. et al. (2018). Adversarial Robustness Toolbox v1.0.0. arXiv preprint arXiv:1807.01069. https://arxiv.org/abs/1807.01069
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1412.6572

Edit this page on GitHub

Adversarial Input Forensics

Related articles

Adversarial Input Forensics

Related articles