Adversarial Input Forensics
Forensic techniques for identifying, reconstructing, and analyzing adversarial inputs used to manipulate AI system behavior.
Overview
Adversarial input forensics is the discipline of recovering, analyzing, and attributing malicious inputs that were crafted to cause AI systems to behave in unintended ways. Unlike traditional software exploitation where payloads leave clear signatures in memory or on disk, adversarial inputs to AI systems can be subtle perturbations -- imperceptible pixel shifts in images, carefully selected token substitutions in text, or statistically crafted feature vectors in tabular data -- that leave minimal forensic traces unless specific instrumentation is in place.
This article covers the complete forensic lifecycle for adversarial inputs: from detection and evidence preservation through reconstruction, analysis, and attribution. We draw on established frameworks including MITRE ATLAS (Adversarial Threat Landscape for AI Systems) and the NIST AI Risk Management Framework to ground forensic procedures in recognized standards.
The forensic investigator's challenge is compounded by the fact that adversarial inputs are, by design, difficult to distinguish from legitimate data. A well-crafted adversarial image may look identical to a human observer, and an adversarial text prompt may read as perfectly natural language. The forensic process must therefore rely on computational analysis, model introspection, and statistical anomaly detection rather than human inspection alone.
Adversarial Input Taxonomy
Before diving into forensic techniques, investigators must understand the taxonomy of adversarial inputs they may encounter. Each category presents different forensic challenges and requires different collection and analysis approaches.
Perturbation-Based Attacks
Perturbation attacks apply small, calculated modifications to legitimate inputs. In computer vision, these are typically L-p norm bounded perturbations computed via gradient-based methods such as FGSM (Fast Gradient Sign Method), PGD (Projected Gradient Descent), or C&W (Carlini & Wagner) attacks. In NLP, perturbations manifest as character-level swaps, word-level substitutions using synonym sets, or sentence-level paraphrasing that preserves semantic meaning while altering model predictions.
| Attack Method | Domain | Perturbation Type | Forensic Detectability |
|---|---|---|---|
| FGSM | Vision | L-infinity bounded pixel shifts | Medium -- uniform noise pattern |
| PGD | Vision | Iterative L-p bounded | Low -- optimized to minimize detectability |
| C&W | Vision | L-2 optimized | Low -- specifically minimizes perturbation magnitude |
| TextFooler | NLP | Word-level substitution | Medium -- unusual synonym choices |
| DeepWordBug | NLP | Character-level perturbation | High -- typo-like artifacts |
| Universal Perturbation | Vision | Single perturbation for many inputs | High -- reused pattern across samples |
Patch-Based Attacks
Adversarial patches are localized, visually conspicuous modifications applied to a small region of an input. Unlike perturbation attacks, patches do not attempt to be imperceptible. Instead, they exploit the model's sensitivity to specific spatial patterns. In physical-world attacks, patches can be printed and placed in the environment -- on stop signs, clothing, or other objects within a camera's field of view.
Input Transformation Attacks
These attacks apply geometric or color-space transformations that are technically within the distribution of natural variation but push the input across a decision boundary. Rotations, scaling, brightness shifts, and JPEG compression artifacts can all be weaponized. These are forensically challenging because the transformations are individually innocuous.
Adversarial Text Prompts
In the context of LLMs, adversarial inputs include prompt injections, jailbreak sequences, and obfuscated instructions. These are covered in depth in the Prompt Injection Forensics article, but the forensic framework here applies to the broader class of adversarial text inputs targeting any NLP model.
Evidence Collection and Preservation
Logging Architecture for Adversarial Input Capture
Effective forensic investigation depends on logging infrastructure that captures sufficient data at inference time. The following Python module demonstrates a forensic logging wrapper that records the metadata needed for post-incident analysis.
import hashlib
import json
import time
import logging
from dataclasses import dataclass, field, asdict
from typing import Any
from pathlib import Path
logger = logging.getLogger("adversarial_forensics")
@dataclass
class InferenceRecord:
"""Forensic record of a single inference request."""
request_id: str
timestamp: float
input_hash_sha256: str
input_size_bytes: int
input_modality: str # "text", "image", "tabular", "audio"
model_id: str
model_version: str
prediction: Any = None
confidence_scores: list[float] = field(default_factory=list)
latency_ms: float = 0.0
input_metadata: dict = field(default_factory=dict)
anomaly_flags: list[str] = field(default_factory=list)
class ForensicInferenceLogger:
"""Wraps model inference with forensic-grade logging."""
def __init__(self, log_dir: str, model_id: str, model_version: str):
self.log_dir = Path(log_dir)
self.log_dir.mkdir(parents=True, exist_ok=True)
self.model_id = model_id
self.model_version = model_version
def compute_input_hash(self, raw_input: bytes) -> str:
return hashlib.sha256(raw_input).hexdigest()
def log_inference(self, record: InferenceRecord) -> None:
log_path = self.log_dir / f"{record.request_id}.json"
log_path.write_text(json.dumps(asdict(record), default=str))
logger.info(
"Logged inference %s (anomaly_flags=%s)",
record.request_id,
record.anomaly_flags,
)
def create_record(
self,
request_id: str,
raw_input: bytes,
modality: str,
metadata: dict | None = None,
) -> InferenceRecord:
return InferenceRecord(
request_id=request_id,
timestamp=time.time(),
input_hash_sha256=self.compute_input_hash(raw_input),
input_size_bytes=len(raw_input),
input_modality=modality,
model_id=self.model_id,
model_version=self.model_version,
input_metadata=metadata or {},
)Chain of Custody for Digital Adversarial Samples
When adversarial inputs are identified, they become evidence. The chain of custody must be maintained following digital forensics best practices adapted for AI artifacts.
- Immediate preservation: Hash the raw input bytes using SHA-256 before any transformation. Store the hash in immutable, append-only storage.
- Raw input archival: Store the unmodified input in write-once storage. For images, preserve the original file with EXIF data intact. For text, preserve raw bytes including encoding.
- Context capture: Record the full inference context -- system prompt, conversation history, retrieval-augmented generation (RAG) context, and any tool call results that accompanied the adversarial input.
- Model state snapshot: Record the exact model version, checkpoint hash, configuration parameters, and any adapter weights (LoRA, QLoRA) active at the time of the incident.
Detection Techniques
Statistical Anomaly Detection
The first layer of adversarial input detection relies on statistical properties of the input that deviate from the expected distribution. These methods do not require knowledge of the specific attack used.
import numpy as np
from scipy import stats
class StatisticalAdversarialDetector:
"""Detect adversarial inputs using statistical distribution analysis."""
def __init__(self, reference_stats: dict):
"""
Initialize with reference statistics computed from known-clean data.
Args:
reference_stats: Dict with keys 'mean', 'std', 'kurtosis',
'skewness' computed from clean input distribution.
"""
self.reference = reference_stats
def analyze_image_input(self, pixel_array: np.ndarray) -> dict:
"""
Analyze an image input for adversarial perturbation indicators.
Returns a dict of anomaly scores and flags.
"""
results = {}
# Check pixel value distribution against reference
flat = pixel_array.flatten().astype(np.float64)
results["mean_deviation"] = abs(float(np.mean(flat)) - self.reference["mean"])
results["std_deviation"] = abs(float(np.std(flat)) - self.reference["std"])
# Kurtosis analysis -- adversarial perturbations often alter
# the tails of the pixel distribution
input_kurtosis = float(stats.kurtosis(flat))
results["kurtosis_deviation"] = abs(
input_kurtosis - self.reference["kurtosis"]
)
# High-frequency energy analysis via discrete differences
# Adversarial perturbations often increase high-frequency content
if pixel_array.ndim >= 2:
dx = np.diff(pixel_array, axis=-1)
dy = np.diff(pixel_array, axis=-2)
hf_energy = float(np.mean(dx ** 2) + np.mean(dy ** 2))
results["high_freq_energy"] = hf_energy
# Compute overall anomaly score (weighted combination)
anomaly_score = (
0.2 * min(results["mean_deviation"] / 10.0, 1.0)
+ 0.2 * min(results["std_deviation"] / 10.0, 1.0)
+ 0.3 * min(results["kurtosis_deviation"] / 2.0, 1.0)
+ 0.3 * min(results.get("high_freq_energy", 0) / 1000.0, 1.0)
)
results["anomaly_score"] = float(anomaly_score)
results["flagged"] = anomaly_score > 0.65
return results
def analyze_text_input(self, tokens: list[str], token_ids: list[int]) -> dict:
"""Analyze text input for adversarial token patterns."""
results = {}
# Perplexity-based analysis: adversarial text often has unusual
# token co-occurrence patterns
unique_ratio = len(set(token_ids)) / max(len(token_ids), 1)
results["unique_token_ratio"] = float(unique_ratio)
# Check for character-level perturbation indicators
unicode_anomalies = sum(
1 for t in tokens if any(ord(c) > 127 for c in t)
)
results["unicode_anomaly_count"] = unicode_anomalies
# Check for unusual whitespace or control characters
control_chars = sum(
1 for t in tokens
for c in t
if ord(c) < 32 and c not in ('\n', '\t', '\r')
)
results["control_char_count"] = control_chars
results["flagged"] = (
unicode_anomalies > len(tokens) * 0.1
or control_chars > 0
)
return resultsModel Confidence Analysis
Adversarial inputs often produce characteristic patterns in model confidence scores. A correctly classified benign input typically produces a high-confidence prediction with a clear margin between the top-1 and top-2 classes. Adversarial inputs, by contrast, may produce predictions with atypical confidence distributions.
def analyze_confidence_distribution(
confidence_scores: list[float],
threshold_margin: float = 0.15,
threshold_entropy: float = 1.5,
) -> dict:
"""
Analyze model confidence scores for adversarial indicators.
Adversarial inputs often produce:
- Low margin between top-1 and top-2 predictions
- High prediction entropy
- Confidence values clustered near decision boundaries
"""
scores = sorted(confidence_scores, reverse=True)
top1 = scores[0] if scores else 0.0
top2 = scores[1] if len(scores) > 1 else 0.0
margin = top1 - top2
# Shannon entropy of the prediction distribution
probs = np.array(confidence_scores, dtype=np.float64)
probs = probs / probs.sum() # normalize
entropy = float(-np.sum(probs * np.log2(probs + 1e-10)))
return {
"top1_confidence": top1,
"top2_confidence": top2,
"margin": margin,
"entropy": entropy,
"low_margin_flag": margin < threshold_margin,
"high_entropy_flag": entropy > threshold_entropy,
"adversarial_likelihood": "high" if (
margin < threshold_margin and entropy > threshold_entropy
) else "medium" if (
margin < threshold_margin or entropy > threshold_entropy
) else "low",
}Feature Squeezing Detection
Feature squeezing, introduced by Xu et al. (2018), detects adversarial inputs by comparing a model's prediction on the original input against predictions on "squeezed" versions of the input -- versions where the input space has been reduced through operations like bit-depth reduction or spatial smoothing. If the predictions differ significantly, the input is likely adversarial.
from typing import Callable
import numpy as np
def feature_squeezing_detect(
input_array: np.ndarray,
predict_fn: Callable[[np.ndarray], np.ndarray],
bit_depth: int = 4,
spatial_sigma: float = 1.0,
) -> dict:
"""
Detect adversarial inputs using feature squeezing.
Based on Xu et al. 2018 -- 'Feature Squeezing: Detecting
Adversarial Examples in Deep Neural Networks' (NDSS 2018).
"""
from scipy.ndimage import gaussian_filter
original_pred = predict_fn(input_array)
# Squeeze 1: Bit-depth reduction
max_val = input_array.max()
levels = 2 ** bit_depth
squeezed_bits = np.round(input_array * levels / max_val) * max_val / levels
pred_bits = predict_fn(squeezed_bits)
# Squeeze 2: Spatial smoothing
squeezed_smooth = gaussian_filter(input_array, sigma=spatial_sigma)
pred_smooth = predict_fn(squeezed_smooth)
# L1 distance between original and squeezed predictions
dist_bits = float(np.max(np.abs(original_pred - pred_bits)))
dist_smooth = float(np.max(np.abs(original_pred - pred_smooth)))
max_dist = max(dist_bits, dist_smooth)
return {
"bit_squeeze_distance": dist_bits,
"smooth_squeeze_distance": dist_smooth,
"max_squeeze_distance": max_dist,
"adversarial_detected": max_dist > 0.1,
}Reconstruction and Analysis
Perturbation Extraction
Once an adversarial input is identified, the next forensic step is to extract and characterize the adversarial perturbation. This requires access to either the original clean input (if available) or a denoised reconstruction.
def extract_perturbation(
adversarial_input: np.ndarray,
clean_reference: np.ndarray,
) -> dict:
"""
Extract and characterize the adversarial perturbation.
Computes the difference between the adversarial and clean inputs,
then analyzes its properties for attribution purposes.
"""
perturbation = adversarial_input.astype(np.float64) - clean_reference.astype(np.float64)
analysis = {
"l2_norm": float(np.linalg.norm(perturbation)),
"linf_norm": float(np.max(np.abs(perturbation))),
"l1_norm": float(np.sum(np.abs(perturbation))),
"mean_perturbation": float(np.mean(np.abs(perturbation))),
"sparsity": float(np.mean(np.abs(perturbation) < 1e-6)),
"spatial_distribution": _analyze_spatial_distribution(perturbation),
}
# Classify likely attack method based on perturbation characteristics
if analysis["linf_norm"] < 0.05 and analysis["sparsity"] < 0.1:
analysis["likely_method"] = "PGD or FGSM (dense, small perturbation)"
elif analysis["sparsity"] > 0.9:
analysis["likely_method"] = "Sparse attack (few pixels modified)"
elif analysis["l2_norm"] < 2.0:
analysis["likely_method"] = "C&W L2 attack (optimized small L2)"
else:
analysis["likely_method"] = "Unknown or patch-based attack"
return analysis
def _analyze_spatial_distribution(perturbation: np.ndarray) -> dict:
"""Analyze where in the input the perturbation is concentrated."""
if perturbation.ndim < 2:
return {"type": "1d", "peak_index": int(np.argmax(np.abs(perturbation)))}
energy = np.sum(perturbation ** 2, axis=-1) if perturbation.ndim == 3 else perturbation ** 2
h, w = energy.shape
quadrants = {
"top_left": float(energy[:h//2, :w//2].sum()),
"top_right": float(energy[:h//2, w//2:].sum()),
"bottom_left": float(energy[h//2:, :w//2].sum()),
"bottom_right": float(energy[h//2:, w//2:].sum()),
}
total = sum(quadrants.values())
if total > 0:
quadrants = {k: v / total for k, v in quadrants.items()}
return quadrantsAttack Method Fingerprinting
Different adversarial attack algorithms leave characteristic fingerprints in the perturbation structure. Forensic investigators can use these fingerprints to attribute an attack to a specific method or toolset.
FGSM fingerprint: Perturbation values are clustered at +epsilon and -epsilon with very few intermediate values, since FGSM takes a single gradient step with the sign function.
PGD fingerprint: Perturbation values are more continuously distributed but still bounded by epsilon. The distribution is smoother than FGSM because of the iterative optimization.
C&W fingerprint: Perturbation values follow a near-Gaussian distribution centered at zero, with the L2 norm being notably smaller than the L-infinity norm would suggest. The perturbation is highly structured, concentrating on semantically important regions.
AutoAttack fingerprint: Combines multiple attack strategies; the perturbation may show characteristics of multiple methods depending on which sub-attack succeeded.
MITRE ATLAS Mapping
Forensic findings should be mapped to MITRE ATLAS techniques to enable cross-organization threat intelligence sharing. The following table maps common adversarial input attack types to ATLAS technique identifiers.
| Forensic Finding | ATLAS Technique | ATLAS ID |
|---|---|---|
| Gradient-based perturbation detected | Craft Adversarial Data > White-Box Optimization | AML.T0043.000 |
| Black-box query pattern in logs | Craft Adversarial Data > Black-Box Optimization | AML.T0043.001 |
| Physical-world adversarial patch | Craft Adversarial Data > Physical Environment | AML.T0043.003 |
| Adversarial text substitutions | Craft Adversarial Data > Insert Backdoor Trigger | AML.T0043.002 |
| Transfer attack from surrogate model | Develop Capabilities > Adversarial ML Attack Development | AML.T0018 |
| Evasion of input validation | Evade ML Model | AML.T0015 |
Forensic Investigation Workflow
Step 1: Initial Triage
When an anomalous model behavior is reported, the investigator first determines whether adversarial input is a plausible cause. This involves checking:
- Model confidence patterns at the time of the incident
- Input distribution statistics versus the baseline
- Whether the model output was incorrect or harmful
- Temporal clustering of anomalous predictions
Step 2: Evidence Preservation
Execute the evidence collection protocol described above. Ensure all raw inputs, model outputs, and system state are preserved with cryptographic integrity verification.
Step 3: Adversarial Confirmation
Apply the detection methods (statistical analysis, feature squeezing, confidence analysis) to confirm that the input was adversarial rather than a natural edge case or data quality issue.
Step 4: Perturbation Extraction and Analysis
Extract the adversarial perturbation and characterize it using the methods described above. This step produces the technical evidence needed for attribution.
Step 5: Attack Attribution
Map the perturbation characteristics to known attack methods. Cross-reference with threat intelligence on known adversarial toolkits (Adversarial Robustness Toolbox, Foolbox, CleverHans) and their default parameters.
# Example: Using ART (Adversarial Robustness Toolbox) to test
# whether a sample matches known attack profiles
pip install adversarial-robustness-toolbox
# Run the ART detection suite against a suspicious sample
python -c "
from art.defences.detector.evasion import BinaryInputDetector
from art.estimators.classification import PyTorchClassifier
import torch
import numpy as np
# Load your model and suspicious sample
# detector = BinaryInputDetector(classifier)
# result = detector.detect(suspicious_input)
print('ART detection pipeline initialized')
"Step 6: Reporting and Remediation
Produce a forensic report that includes:
- Timeline of the incident
- Technical analysis of the adversarial input
- ATLAS technique mapping
- Recommended mitigations (adversarial training, input preprocessing, ensemble defenses)
- Indicators of compromise (IoCs) for sharing with peer organizations
Tooling Reference
Several established tools support adversarial input forensics:
- Adversarial Robustness Toolbox (ART) by IBM Research: Provides both attack simulation and detection capabilities. Available at github.com/Trusted-AI/adversarial-robustness-toolbox.
- Foolbox: A Python library for creating and analyzing adversarial examples. Supports multiple deep learning frameworks.
- CleverHans: Originally developed by Goodfellow et al. for adversarial example research.
- Counterfit by Microsoft: An automation tool for assessing ML model security, built on top of ART and TextAttack.
- MITRE ATLAS Navigator: For mapping forensic findings to standardized threat techniques.
Case Study: Image Classification Evasion
Consider a production image classification system used for content moderation that begins misclassifying harmful images as benign. The forensic investigation proceeds as follows:
-
Detection: Monitoring dashboards show a spike in misclassification rate from 2% to 12% over a 4-hour window, concentrated on a specific content category.
-
Evidence collection: The forensic team exports all inference records from the affected time window, including raw images, model predictions, and confidence scores.
-
Statistical analysis: The high-frequency energy analysis reveals that flagged images contain elevated noise in the 10-50 cycle/pixel frequency band, inconsistent with natural camera noise or JPEG artifacts.
-
Perturbation extraction: Comparing flagged images against their nearest neighbors in the clean training set reveals a consistent perturbation pattern with L-infinity norm of 8/255 -- a common epsilon value for PGD attacks.
-
Attribution: The perturbation characteristics match the default PGD configuration in the Adversarial Robustness Toolbox, suggesting the attacker used an off-the-shelf tool without customization.
-
Remediation: The team deploys adversarial training using PGD-generated examples and adds a feature squeezing preprocessor as an additional detection layer.
Advanced Topics
Transferability Analysis
Adversarial inputs crafted against one model often transfer to other models -- a property known as adversarial transferability. Forensic investigators should test recovered adversarial samples against other models in the organization's deployment to assess the scope of the threat.
Adaptive Attacks Against Detectors
Sophisticated attackers may craft adversarial inputs that simultaneously evade both the target model and adversarial detection mechanisms. This adaptive attack scenario requires defense-in-depth strategies and continuous updates to detection methods. The forensic investigator should be aware that the absence of detection does not guarantee the absence of adversarial inputs.
Temporal Pattern Analysis
When adversarial inputs are delivered as part of a sustained campaign, temporal analysis can reveal patterns in attack timing, frequency, and evolution. Attackers may start with crude attacks and progressively refine their techniques based on observed success rates, creating an observable escalation pattern in the forensic record.
References
- Xu, W., Evans, D., & Qi, Y. (2018). Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks. Proceedings of the 2018 Network and Distributed System Security Symposium (NDSS). https://doi.org/10.14722/ndss.2018.23198
- MITRE ATLAS. (2024). Adversarial Threat Landscape for Artificial Intelligence Systems. https://atlas.mitre.org/
- Nicolae, M.-I. et al. (2018). Adversarial Robustness Toolbox v1.0.0. arXiv preprint arXiv:1807.01069. https://arxiv.org/abs/1807.01069
- Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1412.6572