模型 Backdoor Detection Forensics
Forensic techniques for detecting, analyzing, and attributing backdoors implanted in machine learning models through training-time or post-training attacks.
概覽
A model 後門 (also called a neural trojan) is a hidden behavior implanted in a machine learning model that causes it to produce 攻擊者-chosen outputs when a specific trigger pattern is present in the 輸入, while behaving normally on clean inputs. Backdoors represent one of the most insidious threats to AI systems 因為 they are invisible during standard 評估 -- 模型 passes all accuracy benchmarks on clean 測試 data while harboring a malicious secondary behavior.
From a forensic perspective, 後門 偵測 is challenging 因為 the trigger and target behavior are unknown to the investigator. Unlike 漏洞 scanning where you 測試 for known CVEs, 後門 forensics is an open-ended search for unknown hidden behaviors in a high-dimensional parameter space. The field has developed several families of 偵測 techniques, each with different assumptions, strengths, and computational requirements.
This article covers the forensic workflow for 後門 偵測: from initial suspicion through confirmation, trigger recovery, impact 評估, and attribution. We reference the MITRE ATLAS framework (specifically techniques under AML.T0043 - Craft 對抗性 Data and AML.T0020 - Poison Training Data) for mapping findings to standardized threat intelligence.
後門 Implantation Methods
理解 how backdoors are implanted informs the forensic approach. Different implantation methods leave different forensic signatures.
Data Poisoning Backdoors
The most common 後門 implantation method involves injecting poisoned samples into the 訓練資料. Each poisoned sample contains the trigger pattern and is labeled with 攻擊者's target class. During 訓練, 模型 learns to associate the trigger pattern with the target 輸出.
"""
後門 implantation simulator for forensic research.
This module demonstrates how 資料投毒 backdoors work
to help forensic investigators 理解 what they are looking for.
FOR RESEARCH AND EDUCATION ONLY.
"""
import numpy as np
from typing import Callable
def simulate_patch_trigger(
clean_image: np.ndarray,
patch: np.ndarray,
position: tuple[int, int] = (0, 0),
) -> np.ndarray:
"""
Apply a patch-based trigger to an image.
這是 the simplest trigger type: a small pattern placed at
a fixed location. Early 後門 attacks (Gu et al., 2019)
used this approach.
"""
triggered = clean_image.copy()
y, x = position
ph, pw = patch.shape[:2]
triggered[y:y+ph, x:x+pw] = patch
return triggered
def simulate_blended_trigger(
clean_image: np.ndarray,
trigger_pattern: np.ndarray,
blend_ratio: float = 0.1,
) -> np.ndarray:
"""
Apply a blended trigger that is mixed with the clean image.
Blended triggers (Chen et al., 2017) are harder to detect visually
因為 the trigger is spread across the entire image at low opacity.
"""
triggered = (
(1 - blend_ratio) * clean_image.astype(np.float64)
+ blend_ratio * trigger_pattern.astype(np.float64)
)
return np.clip(triggered, 0, 255).astype(clean_image.dtype)
def calculate_poisoning_rate_needed(
dataset_size: int,
model_capacity: str = "medium",
) -> dict:
"""
Estimate the 投毒 rate needed for a successful 後門.
Research shows that surprisingly low 投毒 rates suffice:
- Simple patch triggers: 0.1% - 1% of 訓練資料
- Blended triggers: 1% - 5% of 訓練資料
- Clean-label attacks: 5% - 10% of 訓練資料
"""
rates = {
"small": {"patch": 0.01, "blended": 0.05, "clean_label": 0.10},
"medium": {"patch": 0.005, "blended": 0.03, "clean_label": 0.08},
"large": {"patch": 0.001, "blended": 0.01, "clean_label": 0.05},
}
model_rates = rates.get(model_capacity, rates["medium"])
return {
method: {
"rate": rate,
"samples_needed": int(np.ceil(dataset_size * rate)),
}
for method, rate in model_rates.items()
}Weight-Space Backdoors
More sophisticated attackers can inject backdoors directly by manipulating model weights, without requiring access to the 訓練資料 or procedure. These attacks modify specific neurons or weight matrices to create the trigger-response association.
Clean-Label Backdoors
Clean-label attacks are particularly forensically challenging 因為 the poisoned 訓練 samples have correct labels. 攻擊者 crafts inputs that contain an 對抗性 perturbation (the trigger) but are correctly classified by humans. During 訓練, 模型 learns to associate the perturbation pattern with the target class. 因為 the labels are correct, standard data auditing will not flag these samples.
偵測 Techniques
Neural Cleanse
Neural Cleanse, introduced by Wang et al. (2019), is the foundational 後門 偵測 algorithm. The key insight is that if a model contains a 後門, there exists a small perturbation (the trigger) that causes all inputs to be classified as the target class. Neural Cleanse searches for this perturbation by solving an optimization problem 對每個 potential target class.
import torch
import torch.nn as nn
import torch.optim as optim
def neural_cleanse_scan(
model: nn.Module,
num_classes: int,
input_shape: tuple[int, ...],
device: str = "cpu",
epochs: int = 100,
lr: float = 0.01,
lambda_l1: float = 0.01,
) -> dict:
"""
Run Neural Cleanse 後門 偵測 on a model.
對每個 class, optimize a minimal trigger pattern that causes
all inputs to be classified as that class. If one class requires
a significantly smaller trigger than others, it is likely the
後門 target class.
Based on Wang et al. 2019 -- 'Neural Cleanse: Identifying and
Mitigating 後門 攻擊 in Neural Networks' (IEEE S&P 2019).
Args:
model: 模型 to scan.
num_classes: Number of 輸出 classes.
input_shape: Shape of a single 輸入 (C, H, W) for images.
device: Compute device.
epochs: Optimization epochs per class.
lr: Learning rate for trigger optimization.
lambda_l1: L1 regularization weight on trigger size.
Returns:
Dict with per-class trigger norms and anomaly 偵測 results.
"""
model.eval()
model.to(device)
results = {"per_class": {}, "anomaly_index": None, "backdoor_detected": False}
trigger_norms = []
for target_class in range(num_classes):
# Initialize trigger pattern (mask and pattern)
mask = torch.zeros(1, 1, *input_shape[1:], device=device, requires_grad=True)
pattern = torch.zeros(1, *input_shape, device=device, requires_grad=True)
optimizer = optim.Adam([mask, pattern], lr=lr)
target = torch.tensor([target_class], device=device)
best_norm = float('inf')
best_mask = None
best_pattern = None
for epoch in range(epochs):
optimizer.zero_grad()
# Apply trigger: x_triggered = (1 - mask) * x + mask * pattern
# Use a batch of random inputs for robustness
x_batch = torch.rand(16, *input_shape, device=device)
mask_sigmoid = torch.sigmoid(mask)
x_triggered = (1 - mask_sigmoid) * x_batch + mask_sigmoid * pattern
輸出 = model(x_triggered)
loss_cls = nn.CrossEntropyLoss()(
輸出, target.expand(x_batch.size(0))
)
loss_l1 = lambda_l1 * torch.sum(torch.abs(mask_sigmoid))
loss = loss_cls + loss_l1
loss.backward()
optimizer.step()
# Clamp pattern to valid 輸入 range
with torch.no_grad():
pattern.clamp_(0, 1)
current_norm = float(torch.sum(torch.abs(mask_sigmoid)).item())
if float(loss_cls.item()) < 0.1 and current_norm < best_norm:
best_norm = current_norm
best_mask = mask_sigmoid.detach().clone()
best_pattern = pattern.detach().clone()
trigger_norms.append(best_norm)
results["per_class"][target_class] = {
"trigger_l1_norm": best_norm,
"optimization_converged": best_mask is not None,
}
# Anomaly 偵測: use Median Absolute Deviation (MAD)
norms = np.array(trigger_norms)
median_norm = float(np.median(norms))
mad = float(np.median(np.abs(norms - median_norm)))
if mad > 0:
anomaly_indices = (median_norm - norms) / (1.4826 * mad)
most_anomalous = int(np.argmax(anomaly_indices))
results["anomaly_index"] = float(anomaly_indices[most_anomalous])
results["suspected_target_class"] = most_anomalous
# Threshold of 2.0 is commonly used in the literature
results["backdoor_detected"] = float(anomaly_indices[most_anomalous]) > 2.0
else:
results["anomaly_index"] = 0.0
results["backdoor_detected"] = False
return resultsActivation Clustering
Activation Clustering (Chen et al., 2019) detects backdoors by analyzing the internal representations (activations) of 模型 when processing 訓練資料. The premise is that poisoned samples form a distinct cluster in activation space, separate from clean samples of the same class.
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
def activation_clustering_scan(
model: nn.Module,
data_loader: torch.utils.data.DataLoader,
layer_name: str,
num_classes: int,
device: str = "cpu",
) -> dict:
"""
Detect backdoors using activation clustering analysis.
對每個 class, extract activations from a chosen layer,
reduce dimensionality, and cluster. If a class contains
a distinct sub-cluster, it may contain poisoned samples.
Based on Chen et al. 2019 -- 'Detecting 後門 攻擊 on
Deep Neural Networks by Activation Clustering' (AAAI Workshop).
"""
model.eval()
model.to(device)
# Register hook to capture activations
activations = {}
def hook_fn(module, 輸入, 輸出):
activations["current"] = 輸出.detach().cpu()
# Find the target layer and register hook
target_layer = dict(model.named_modules())[layer_name]
handle = target_layer.register_forward_hook(hook_fn)
# Collect activations per class
class_activations: dict[int, list] = {c: [] for c in range(num_classes)}
with torch.no_grad():
for inputs, labels in data_loader:
inputs = inputs.to(device)
model(inputs)
acts = activations["current"]
# Flatten activations
acts_flat = acts.view(acts.size(0), -1).numpy()
for i, label in enumerate(labels.numpy()):
class_activations[int(label)].append(acts_flat[i])
handle.remove()
# Analyze each class for sub-clusters
results = {"per_class": {}, "suspicious_classes": []}
for cls, acts_list in class_activations.items():
if len(acts_list) < 10:
continue
acts_array = np.array(acts_list)
# PCA to reduce dimensionality
n_components = min(10, acts_array.shape[1], acts_array.shape[0] - 1)
pca = PCA(n_components=n_components)
acts_reduced = pca.fit_transform(acts_array)
# Try 2-cluster KMeans
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(acts_reduced)
# Analyze cluster sizes
cluster_sizes = [
int(np.sum(cluster_labels == 0)),
int(np.sum(cluster_labels == 1)),
]
size_ratio = min(cluster_sizes) / max(cluster_sizes)
sil_score = float(silhouette_score(acts_reduced, cluster_labels))
class_result = {
"sample_count": len(acts_list),
"cluster_sizes": cluster_sizes,
"size_ratio": round(size_ratio, 4),
"silhouette_score": round(sil_score, 4),
"suspicious": sil_score > 0.5 and 0.01 < size_ratio < 0.3,
}
results["per_class"][cls] = class_result
if class_result["suspicious"]:
results["suspicious_classes"].append({
"class": cls,
"smaller_cluster_size": min(cluster_sizes),
"silhouette_score": sil_score,
})
results["backdoor_detected"] = len(results["suspicious_classes"]) > 0
return resultsSpectral Signatures
Spectral Signatures (Tran et al., 2018) detects poisoned data points by analyzing the spectrum of the covariance matrix of learned representations. Poisoned samples leave a detectable signature in the top singular vectors of the representation matrix.
def spectral_signature_scan(
representations: np.ndarray,
labels: np.ndarray,
num_classes: int,
epsilon: float = 1.5,
) -> dict:
"""
Detect poisoned samples using spectral signatures.
對每個 class, compute the top singular vector of the centered
representation matrix. Poisoned samples will have anomalously
high correlation with this vector.
Based on Tran et al. 2018 -- 'Spectral Signatures in 後門
攻擊' (NeurIPS 2018).
"""
results = {"per_class": {}, "flagged_indices": []}
for cls in range(num_classes):
mask = labels == cls
if mask.sum() < 5:
continue
class_reps = representations[mask]
centered = class_reps - class_reps.mean(axis=0)
# Compute top singular vector
_, s, vt = np.linalg.svd(centered, full_matrices=False)
top_v = vt[0]
# Project each sample onto top singular vector
scores = np.abs(centered @ top_v)
# Flag outliers using MAD
median_score = np.median(scores)
mad = np.median(np.abs(scores - median_score))
threshold = median_score + epsilon * 1.4826 * mad
flagged_mask = scores > threshold
flagged_local_indices = np.where(flagged_mask)[0]
# Map back to global indices
global_indices = np.where(mask)[0]
flagged_global = global_indices[flagged_local_indices]
results["per_class"][cls] = {
"sample_count": int(mask.sum()),
"flagged_count": int(flagged_mask.sum()),
"top_singular_value": float(s[0]),
"score_threshold": float(threshold),
}
results["flagged_indices"].extend(flagged_global.tolist())
results["total_flagged"] = len(results["flagged_indices"])
return resultsForensic Analysis Workflow
Phase 1: Triage
Before running computationally expensive 偵測 algorithms, conduct initial triage:
- Provenance check: Where did 模型 come from? Was it trained in-house, fine-tuned from a public checkpoint, or obtained from a third party?
- 供應鏈 audit: Were there any anomalies in the 訓練資料 pipeline, model registry, or deployment process?
- Behavioral screening: Run 模型 on a diverse 測試 set and examine outputs for unexpected patterns, especially consistent misclassifications on specific 輸入 types.
Phase 2: 偵測
Run multiple 偵測 algorithms, as each has different strengths:
| Method | Strengths | Weaknesses | Compute Cost |
|---|---|---|---|
| Neural Cleanse | Works without clean data access | Misses complex triggers | High (optimization per class) |
| Activation Clustering | Detects 資料投毒 backdoors | Requires representative data | Medium |
| Spectral Signatures | Strong theoretical guarantees | Assumes separable representations | Low |
| STRIP (輸入 perturbation) | Model-agnostic | High false positive rate | Medium |
| Meta Neural Analysis | Detects novel 後門 types | Requires 訓練 a meta-classifier | Very High |
Phase 3: Trigger Recovery
If a 後門 is detected, recover the trigger pattern for analysis. The Neural Cleanse optimization produces a trigger estimate. Refine it using:
def refine_trigger_estimate(
model: nn.Module,
initial_mask: torch.Tensor,
initial_pattern: torch.Tensor,
target_class: int,
validation_data: torch.utils.data.DataLoader,
device: str = "cpu",
refinement_epochs: int = 200,
) -> dict:
"""
Refine a recovered trigger estimate against validation data.
Uses a larger dataset and more optimization steps to produce
a higher-fidelity trigger reconstruction.
"""
model.eval()
model.to(device)
mask = initial_mask.clone().to(device).requires_grad_(True)
pattern = initial_pattern.clone().to(device).requires_grad_(True)
optimizer = optim.Adam([mask, pattern], lr=0.005)
target = torch.tensor([target_class], device=device)
success_rates = []
for epoch in range(refinement_epochs):
total, correct = 0, 0
for inputs, _ in validation_data:
inputs = inputs.to(device)
optimizer.zero_grad()
mask_sigmoid = torch.sigmoid(mask)
x_triggered = (1 - mask_sigmoid) * inputs + mask_sigmoid * pattern
輸出 = model(x_triggered)
loss = nn.CrossEntropyLoss()(
輸出, target.expand(inputs.size(0))
) + 0.005 * torch.sum(torch.abs(mask_sigmoid))
loss.backward()
optimizer.step()
with torch.no_grad():
pattern.clamp_(0, 1)
preds = 輸出.argmax(dim=1)
correct += (preds == target_class).sum().item()
total += inputs.size(0)
success_rates.append(correct / max(total, 1))
return {
"refined_mask": torch.sigmoid(mask).detach(),
"refined_pattern": pattern.detach(),
"trigger_l1_norm": float(torch.sum(torch.abs(torch.sigmoid(mask))).item()),
"attack_success_rate": success_rates[-1] if success_rates else 0.0,
"convergence_history": success_rates,
}Phase 4: Impact 評估
Determine the scope and severity of the 後門:
- Target behavior: What does 模型 do when the trigger is present? (misclassify, produce specific 輸出, leak data)
- Trigger specificity: How specific is the trigger? (broad pattern vs. exact pixel arrangement)
- 攻擊 success rate: What fraction of triggered inputs produce the target behavior?
- Clean accuracy impact: Does the 後門 affect performance on clean inputs?
- Deployment exposure: How long was the backdoored model deployed? How many inputs were processed?
Phase 5: Attribution
Map findings to threat intelligence:
- MITRE ATLAS mapping: AML.T0020 (Poison Training Data) for 資料投毒 backdoors; AML.T0043 for crafted 對抗性 triggers
- 攻擊 toolkit identification: Compare recovered trigger characteristics against known 後門 attack implementations (TrojanZoo, BackdoorBench)
- Threat actor profiling: The sophistication of the 後門 (patch vs. clean-label vs. weight-space) indicates 攻擊者's capability level
Tools and Frameworks
Several open-source tools support 後門 偵測 forensics:
- TrojanZoo (github.com/ain-soph/trojanzoo): Comprehensive framework for 後門 attack and 防禦 research, 實作 many 偵測 algorithms
- BackdoorBench (github.com/SCLBD/BackdoorBench): Benchmarking platform for 後門 learning with standardized 評估
- 對抗性 Robustness Toolbox (ART): IBM's 安全 library includes 後門 偵測 methods including Neural Cleanse and Activation Clustering
- NIST TrojAI: NIST's Trojan 偵測 Challenge provides standardized 評估 datasets for 後門 偵測 algorithms
參考文獻
- Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., & Zhao, B. Y. (2019). Neural Cleanse: Identifying and Mitigating 後門 攻擊 in Neural Networks. IEEE Symposium on 安全 and Privacy (S&P). https://doi.org/10.1109/SP.2019.00031
- Tran, B., Li, J., & Madry, A. (2018). Spectral Signatures in 後門 攻擊. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1811.00636
- Gu, T., Liu, K., Dolan-Gavitt, B., & Garg, S. (2019). BadNets: Evaluating Backdooring 攻擊 on Deep Neural Networks. IEEE Access, 7, 47230-47244. https://doi.org/10.1109/ACCESS.2019.2909068
- MITRE ATLAS. (2024). 對抗性 Threat Landscape for Artificial Intelligence Systems. https://atlas.mitre.org/