Adversarial Input Forensics
Forensic techniques for identifying, reconstructing, and analyzing adversarial inputs used to manipulate AI system behavior.
概覽
對抗性 輸入 forensics is the discipline of recovering, analyzing, and attributing malicious inputs that were crafted to cause AI systems to behave in unintended ways. Unlike traditional software 利用 where payloads leave clear signatures in memory or on disk, 對抗性 inputs to AI systems can be subtle perturbations -- imperceptible pixel shifts in images, carefully selected 符元 substitutions in text, or statistically crafted feature vectors in tabular data -- that leave minimal forensic traces unless specific instrumentation is in place.
This article covers the complete forensic lifecycle for 對抗性 inputs: from 偵測 and evidence preservation through reconstruction, analysis, and attribution. We draw on established frameworks including MITRE ATLAS (對抗性 Threat Landscape for AI Systems) and the NIST AI Risk Management Framework to ground forensic procedures in recognized standards.
The forensic investigator's challenge is compounded by the fact that 對抗性 inputs are, by design, difficult to distinguish from legitimate data. A well-crafted 對抗性 image may look identical to a human observer, and an 對抗性 text prompt may read as perfectly natural language. The forensic process must 因此 rely on computational analysis, model introspection, and statistical anomaly 偵測 rather than human inspection alone.
對抗性 輸入 Taxonomy
Before diving into forensic techniques, investigators must 理解 the taxonomy of 對抗性 inputs they may encounter. Each category presents different forensic challenges and requires different collection and analysis approaches.
Perturbation-Based 攻擊
Perturbation attacks apply small, calculated modifications to legitimate inputs. In computer vision, these are typically L-p norm bounded perturbations computed via gradient-based methods such as FGSM (Fast Gradient Sign Method), PGD (Projected Gradient Descent), or C&W (Carlini & Wagner) attacks. In NLP, perturbations manifest as character-level swaps, word-level substitutions using synonym sets, or sentence-level paraphrasing that preserves semantic meaning while altering model predictions.
| 攻擊 Method | Domain | Perturbation Type | Forensic Detectability |
|---|---|---|---|
| FGSM | Vision | L-infinity bounded pixel shifts | Medium -- uniform noise pattern |
| PGD | Vision | Iterative L-p bounded | Low -- optimized to minimize detectability |
| C&W | Vision | L-2 optimized | Low -- specifically minimizes perturbation magnitude |
| TextFooler | NLP | Word-level substitution | Medium -- unusual synonym choices |
| DeepWordBug | NLP | Character-level perturbation | High -- typo-like artifacts |
| Universal Perturbation | Vision | Single perturbation for many inputs | High -- reused pattern across samples |
Patch-Based 攻擊
對抗性 patches are localized, visually conspicuous modifications applied to a small region of an 輸入. Unlike perturbation attacks, patches do not attempt to be imperceptible. Instead, they 利用 模型's sensitivity to specific spatial patterns. In physical-world attacks, patches can be printed and placed in the environment -- on stop signs, clothing, or other objects within a camera's field of view.
輸入 Transformation 攻擊
These attacks apply geometric or color-space transformations that are technically within the distribution of natural variation but push the 輸入 across a decision boundary. Rotations, scaling, brightness shifts, and JPEG compression artifacts can all be weaponized. These are forensically challenging 因為 the transformations are individually innocuous.
對抗性 Text Prompts
In the context of LLMs, 對抗性 inputs include prompt injections, 越獄 sequences, and obfuscated instructions. These are covered in depth in the 提示詞注入 Forensics article, but the forensic framework here applies to the broader class of 對抗性 text inputs targeting any NLP model.
Evidence Collection and Preservation
Logging Architecture for 對抗性 輸入 Capture
Effective forensic investigation depends on logging infrastructure that captures sufficient data at 推論 time. The following Python module demonstrates a forensic logging wrapper that records the metadata needed for post-incident analysis.
import hashlib
import json
import time
import logging
from dataclasses import dataclass, field, asdict
from typing import Any
from pathlib import Path
logger = logging.getLogger("adversarial_forensics")
@dataclass
class InferenceRecord:
"""Forensic record of a single 推論 request."""
request_id: str
timestamp: float
input_hash_sha256: str
input_size_bytes: int
input_modality: str # "text", "image", "tabular", "audio"
model_id: str
model_version: str
prediction: Any = None
confidence_scores: list[float] = field(default_factory=list)
latency_ms: float = 0.0
input_metadata: dict = field(default_factory=dict)
anomaly_flags: list[str] = field(default_factory=list)
class ForensicInferenceLogger:
"""Wraps model 推論 with forensic-grade logging."""
def __init__(self, log_dir: str, model_id: str, model_version: str):
self.log_dir = Path(log_dir)
self.log_dir.mkdir(parents=True, exist_ok=True)
self.model_id = model_id
self.model_version = model_version
def compute_input_hash(self, raw_input: bytes) -> str:
return hashlib.sha256(raw_input).hexdigest()
def log_inference(self, record: InferenceRecord) -> None:
log_path = self.log_dir / f"{record.request_id}.json"
log_path.write_text(json.dumps(asdict(record), default=str))
logger.info(
"Logged 推論 %s (anomaly_flags=%s)",
record.request_id,
record.anomaly_flags,
)
def create_record(
self,
request_id: str,
raw_input: bytes,
modality: str,
metadata: dict | None = None,
) -> InferenceRecord:
return InferenceRecord(
request_id=request_id,
timestamp=time.time(),
input_hash_sha256=self.compute_input_hash(raw_input),
input_size_bytes=len(raw_input),
input_modality=modality,
model_id=self.model_id,
model_version=self.model_version,
input_metadata=metadata or {},
)Chain of Custody for Digital 對抗性 Samples
When 對抗性 inputs are identified, they become evidence. The chain of custody must be maintained following digital forensics best practices adapted for AI artifacts.
- Immediate preservation: Hash the raw 輸入 bytes using SHA-256 before any transformation. Store the hash in immutable, append-only storage.
- Raw 輸入 archival: Store the unmodified 輸入 in write-once storage. For images, preserve the original file with EXIF data intact. For text, preserve raw bytes including encoding.
- Context capture: Record the full 推論 context -- 系統提示詞, conversation history, 檢索增強生成 (RAG) context, and any 工具呼叫 results that accompanied the 對抗性 輸入.
- Model state snapshot: Record the exact model version, checkpoint hash, configuration parameters, and any adapter weights (LoRA, QLoRA) active at the time of the incident.
偵測 Techniques
Statistical Anomaly 偵測
The first layer of 對抗性 輸入 偵測 relies on statistical properties of the 輸入 that deviate from the expected distribution. These methods do not require knowledge of the specific attack used.
import numpy as np
from scipy import stats
class StatisticalAdversarialDetector:
"""Detect 對抗性 inputs using statistical distribution analysis."""
def __init__(self, reference_stats: dict):
"""
Initialize with reference statistics computed from known-clean data.
Args:
reference_stats: Dict with keys 'mean', 'std', 'kurtosis',
'skewness' computed from clean 輸入 distribution.
"""
self.reference = reference_stats
def analyze_image_input(self, pixel_array: np.ndarray) -> dict:
"""
Analyze an image 輸入 for 對抗性 perturbation indicators.
Returns a dict of anomaly scores and flags.
"""
results = {}
# Check pixel value distribution against reference
flat = pixel_array.flatten().astype(np.float64)
results["mean_deviation"] = abs(float(np.mean(flat)) - self.reference["mean"])
results["std_deviation"] = abs(float(np.std(flat)) - self.reference["std"])
# Kurtosis analysis -- 對抗性 perturbations often alter
# the tails of the pixel distribution
input_kurtosis = float(stats.kurtosis(flat))
results["kurtosis_deviation"] = abs(
input_kurtosis - self.reference["kurtosis"]
)
# High-frequency energy analysis via discrete differences
# 對抗性 perturbations often increase high-frequency content
if pixel_array.ndim >= 2:
dx = np.diff(pixel_array, axis=-1)
dy = np.diff(pixel_array, axis=-2)
hf_energy = float(np.mean(dx ** 2) + np.mean(dy ** 2))
results["high_freq_energy"] = hf_energy
# Compute overall anomaly score (weighted combination)
anomaly_score = (
0.2 * min(results["mean_deviation"] / 10.0, 1.0)
+ 0.2 * min(results["std_deviation"] / 10.0, 1.0)
+ 0.3 * min(results["kurtosis_deviation"] / 2.0, 1.0)
+ 0.3 * min(results.get("high_freq_energy", 0) / 1000.0, 1.0)
)
results["anomaly_score"] = float(anomaly_score)
results["flagged"] = anomaly_score > 0.65
return results
def analyze_text_input(self, 符元: list[str], token_ids: list[int]) -> dict:
"""Analyze text 輸入 for 對抗性 符元 patterns."""
results = {}
# Perplexity-based analysis: 對抗性 text often has unusual
# 符元 co-occurrence patterns
unique_ratio = len(set(token_ids)) / max(len(token_ids), 1)
results["unique_token_ratio"] = float(unique_ratio)
# Check for character-level perturbation indicators
unicode_anomalies = sum(
1 for t in 符元 if any(ord(c) > 127 for c in t)
)
results["unicode_anomaly_count"] = unicode_anomalies
# Check for unusual whitespace or control characters
control_chars = sum(
1 for t in 符元
for c in t
if ord(c) < 32 and c not in ('\n', '\t', '\r')
)
results["control_char_count"] = control_chars
results["flagged"] = (
unicode_anomalies > len(符元) * 0.1
or control_chars > 0
)
return resultsModel Confidence Analysis
對抗性 inputs often produce characteristic patterns in model confidence scores. A correctly classified benign 輸入 typically produces a high-confidence prediction with a clear margin between the top-1 and top-2 classes. 對抗性 inputs, by contrast, may produce predictions with atypical confidence distributions.
def analyze_confidence_distribution(
confidence_scores: list[float],
threshold_margin: float = 0.15,
threshold_entropy: float = 1.5,
) -> dict:
"""
Analyze model confidence scores for 對抗性 indicators.
對抗性 inputs often produce:
- Low margin between top-1 and top-2 predictions
- High prediction entropy
- Confidence values clustered near decision boundaries
"""
scores = sorted(confidence_scores, reverse=True)
top1 = scores[0] if scores else 0.0
top2 = scores[1] if len(scores) > 1 else 0.0
margin = top1 - top2
# Shannon entropy of the prediction distribution
probs = np.array(confidence_scores, dtype=np.float64)
probs = probs / probs.sum() # normalize
entropy = float(-np.sum(probs * np.log2(probs + 1e-10)))
return {
"top1_confidence": top1,
"top2_confidence": top2,
"margin": margin,
"entropy": entropy,
"low_margin_flag": margin < threshold_margin,
"high_entropy_flag": entropy > threshold_entropy,
"adversarial_likelihood": "high" if (
margin < threshold_margin and entropy > threshold_entropy
) else "medium" if (
margin < threshold_margin or entropy > threshold_entropy
) else "low",
}Feature Squeezing 偵測
Feature squeezing, introduced by Xu et al. (2018), detects 對抗性 inputs by comparing a model's prediction on the original 輸入 against predictions on "squeezed" versions of the 輸入 -- versions where the 輸入 space has been reduced through operations like bit-depth reduction or spatial smoothing. If the predictions differ significantly, the 輸入 is likely 對抗性.
from typing import Callable
import numpy as np
def feature_squeezing_detect(
input_array: np.ndarray,
predict_fn: Callable[[np.ndarray], np.ndarray],
bit_depth: int = 4,
spatial_sigma: float = 1.0,
) -> dict:
"""
Detect 對抗性 inputs using feature squeezing.
Based on Xu et al. 2018 -- 'Feature Squeezing: Detecting
對抗性 範例 in Deep Neural Networks' (NDSS 2018).
"""
from scipy.ndimage import gaussian_filter
original_pred = predict_fn(input_array)
# Squeeze 1: Bit-depth reduction
max_val = input_array.max()
levels = 2 ** bit_depth
squeezed_bits = np.round(input_array * levels / max_val) * max_val / levels
pred_bits = predict_fn(squeezed_bits)
# Squeeze 2: Spatial smoothing
squeezed_smooth = gaussian_filter(input_array, sigma=spatial_sigma)
pred_smooth = predict_fn(squeezed_smooth)
# L1 distance between original and squeezed predictions
dist_bits = float(np.max(np.abs(original_pred - pred_bits)))
dist_smooth = float(np.max(np.abs(original_pred - pred_smooth)))
max_dist = max(dist_bits, dist_smooth)
return {
"bit_squeeze_distance": dist_bits,
"smooth_squeeze_distance": dist_smooth,
"max_squeeze_distance": max_dist,
"adversarial_detected": max_dist > 0.1,
}Reconstruction and Analysis
Perturbation Extraction
Once an 對抗性 輸入 is identified, the next forensic step is to extract and characterize the 對抗性 perturbation. This requires access to either the original clean 輸入 (if available) or a denoised reconstruction.
def extract_perturbation(
adversarial_input: np.ndarray,
clean_reference: np.ndarray,
) -> dict:
"""
Extract and characterize the 對抗性 perturbation.
Computes the difference between the 對抗性 and clean inputs,
then analyzes its properties for attribution purposes.
"""
perturbation = adversarial_input.astype(np.float64) - clean_reference.astype(np.float64)
analysis = {
"l2_norm": float(np.linalg.norm(perturbation)),
"linf_norm": float(np.max(np.abs(perturbation))),
"l1_norm": float(np.sum(np.abs(perturbation))),
"mean_perturbation": float(np.mean(np.abs(perturbation))),
"sparsity": float(np.mean(np.abs(perturbation) < 1e-6)),
"spatial_distribution": _analyze_spatial_distribution(perturbation),
}
# Classify likely attack method based on perturbation characteristics
if analysis["linf_norm"] < 0.05 and analysis["sparsity"] < 0.1:
analysis["likely_method"] = "PGD or FGSM (dense, small perturbation)"
elif analysis["sparsity"] > 0.9:
analysis["likely_method"] = "Sparse attack (few pixels modified)"
elif analysis["l2_norm"] < 2.0:
analysis["likely_method"] = "C&W L2 attack (optimized small L2)"
else:
analysis["likely_method"] = "Unknown or patch-based attack"
return analysis
def _analyze_spatial_distribution(perturbation: np.ndarray) -> dict:
"""Analyze where in the 輸入 the perturbation is concentrated."""
if perturbation.ndim < 2:
return {"type": "1d", "peak_index": int(np.argmax(np.abs(perturbation)))}
energy = np.sum(perturbation ** 2, axis=-1) if perturbation.ndim == 3 else perturbation ** 2
h, w = energy.shape
quadrants = {
"top_left": float(energy[:h//2, :w//2].sum()),
"top_right": float(energy[:h//2, w//2:].sum()),
"bottom_left": float(energy[h//2:, :w//2].sum()),
"bottom_right": float(energy[h//2:, w//2:].sum()),
}
total = sum(quadrants.values())
if total > 0:
quadrants = {k: v / total for k, v in quadrants.items()}
return quadrants攻擊 Method Fingerprinting
Different 對抗性 attack algorithms leave characteristic fingerprints in the perturbation structure. Forensic investigators can use these fingerprints to attribute an attack to a specific method or toolset.
FGSM fingerprint: Perturbation values are clustered at +epsilon and -epsilon with very few intermediate values, since FGSM takes a single gradient step with the sign function.
PGD fingerprint: Perturbation values are more continuously distributed but still bounded by epsilon. The distribution is smoother than FGSM 因為 of the iterative optimization.
C&W fingerprint: Perturbation values follow a near-Gaussian distribution centered at zero, with the L2 norm being notably smaller than the L-infinity norm would suggest. The perturbation is highly structured, concentrating on semantically important regions.
AutoAttack fingerprint: Combines multiple attack strategies; the perturbation may show characteristics of multiple methods depending on which sub-attack succeeded.
MITRE ATLAS Mapping
Forensic findings should be mapped to MITRE ATLAS techniques to enable cross-organization threat intelligence sharing. The following table maps common 對抗性 輸入 attack types to ATLAS technique identifiers.
| Forensic Finding | ATLAS Technique | ATLAS ID |
|---|---|---|
| Gradient-based perturbation detected | Craft 對抗性 Data > White-Box Optimization | AML.T0043.000 |
| Black-box query pattern in logs | Craft 對抗性 Data > Black-Box Optimization | AML.T0043.001 |
| Physical-world 對抗性 patch | Craft 對抗性 Data > Physical Environment | AML.T0043.003 |
| 對抗性 text substitutions | Craft 對抗性 Data > Insert 後門 Trigger | AML.T0043.002 |
| Transfer attack from surrogate model | Develop Capabilities > 對抗性 ML 攻擊 Development | AML.T0018 |
| Evasion of 輸入 validation | Evade ML Model | AML.T0015 |
Forensic Investigation Workflow
Step 1: Initial Triage
When an anomalous model behavior is reported, the investigator first determines whether 對抗性 輸入 is a plausible cause. This involves checking:
- Model confidence patterns at the time of the incident
- 輸入 distribution statistics versus the baseline
- Whether 模型 輸出 was incorrect or harmful
- Temporal clustering of anomalous predictions
Step 2: Evidence Preservation
Execute the evidence collection protocol described above. Ensure all raw inputs, model outputs, and system state are preserved with cryptographic integrity verification.
Step 3: 對抗性 Confirmation
Apply the 偵測 methods (statistical analysis, feature squeezing, confidence analysis) to confirm that the 輸入 was 對抗性 rather than a natural edge case or data quality issue.
Step 4: Perturbation Extraction and Analysis
Extract the 對抗性 perturbation and characterize it using the methods described above. This step produces the technical evidence needed for attribution.
Step 5: 攻擊 Attribution
Map the perturbation characteristics to known attack methods. Cross-reference with threat intelligence on known 對抗性 toolkits (對抗性 Robustness Toolbox, Foolbox, CleverHans) and their default parameters.
# 範例: Using ART (對抗性 Robustness Toolbox) to 測試
# whether a sample matches known attack profiles
pip install 對抗性-robustness-toolbox
# Run the ART 偵測 suite against a suspicious sample
python -c "
from art.defences.detector.evasion import BinaryInputDetector
from art.estimators.classification import PyTorchClassifier
import torch
import numpy as np
# Load your model and suspicious sample
# detector = BinaryInputDetector(classifier)
# result = detector.detect(suspicious_input)
print('ART 偵測 pipeline initialized')
"Step 6: Reporting and Remediation
Produce a forensic report that includes:
- Timeline of the incident
- Technical analysis of the 對抗性 輸入
- ATLAS technique mapping
- Recommended mitigations (對抗性 訓練, 輸入 preprocessing, ensemble 防禦)
- Indicators of compromise (IoCs) for sharing with peer organizations
Tooling Reference
Several established tools support 對抗性 輸入 forensics:
- 對抗性 Robustness Toolbox (ART) by IBM Research: Provides both attack simulation and 偵測 capabilities. Available at github.com/Trusted-AI/對抗性-robustness-toolbox.
- Foolbox: A Python library for creating and analyzing 對抗性 examples. Supports multiple deep learning frameworks.
- CleverHans: Originally developed by Goodfellow et al. for 對抗性 example research.
- Counterfit by Microsoft: An automation tool for assessing ML model 安全, built on top of ART and TextAttack.
- MITRE ATLAS Navigator: For mapping forensic findings to standardized threat techniques.
Case Study: Image Classification Evasion
考慮 a production image classification system used for content moderation that begins misclassifying harmful images as benign. The forensic investigation proceeds as follows:
-
偵測: 監控 dashboards show a spike in misclassification rate from 2% to 12% over a 4-hour window, concentrated on a specific content category.
-
Evidence collection: The forensic team exports all 推論 records from the affected time window, including raw images, model predictions, and confidence scores.
-
Statistical analysis: The high-frequency energy analysis reveals that flagged images contain elevated noise in the 10-50 cycle/pixel frequency band, inconsistent with natural camera noise or JPEG artifacts.
-
Perturbation extraction: Comparing flagged images against their nearest neighbors in the clean 訓練 set reveals a consistent perturbation pattern with L-infinity norm of 8/255 -- a common epsilon value for PGD attacks.
-
Attribution: The perturbation characteristics match the default PGD configuration in the 對抗性 Robustness Toolbox, suggesting 攻擊者 used an off-the-shelf tool without customization.
-
Remediation: The team deploys 對抗性 訓練 using PGD-generated examples and adds a feature squeezing preprocessor as an additional 偵測 layer.
Advanced Topics
Transferability Analysis
對抗性 inputs crafted against one model often transfer to other models -- a property known as 對抗性 transferability. Forensic investigators should 測試 recovered 對抗性 samples against other models in the organization's deployment to 評估 the scope of the threat.
Adaptive 攻擊 Against Detectors
Sophisticated attackers may craft 對抗性 inputs that simultaneously evade both the target model and 對抗性 偵測 mechanisms. This adaptive attack scenario requires 防禦-in-depth strategies and continuous updates to 偵測 methods. The forensic investigator should be aware that the absence of 偵測 does not guarantee the absence of 對抗性 inputs.
Temporal Pattern Analysis
When 對抗性 inputs are delivered as part of a sustained campaign, temporal analysis can reveal patterns in attack timing, frequency, and evolution. Attackers may start with crude attacks and progressively refine their techniques based on observed success rates, creating an observable escalation pattern in the forensic record.
參考文獻
- Xu, W., Evans, D., & Qi, Y. (2018). Feature Squeezing: Detecting 對抗性 範例 in Deep Neural Networks. Proceedings of the 2018 Network and Distributed System 安全 Symposium (NDSS). https://doi.org/10.14722/ndss.2018.23198
- MITRE ATLAS. (2024). 對抗性 Threat Landscape for Artificial Intelligence Systems. https://atlas.mitre.org/
- Nicolae, M.-I. et al. (2018). 對抗性 Robustness Toolbox v1.0.0. arXiv preprint arXiv:1807.01069. https://arxiv.org/abs/1807.01069
- Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing 對抗性 範例. Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1412.6572