模型 Backdoor Detection Forensics

Advanced12 min readUpdated 2026-03-20

Forensic techniques for detecting, analyzing, and attributing backdoors implanted in machine learning models through training-time or post-training attacks.

ai-forensics-ir backdoor-detection model-security trojan-detection

概覽

A model 後門 (also called a neural trojan) is a hidden behavior implanted in a machine learning model that causes it to produce 攻擊者-chosen outputs when a specific trigger pattern is present in the 輸入, while behaving normally on clean inputs. Backdoors represent one of the most insidious threats to AI systems 因為 they are invisible during standard 評估 -- 模型 passes all accuracy benchmarks on clean 測試 data while harboring a malicious secondary behavior.

From a forensic perspective, 後門偵測 is challenging 因為 the trigger and target behavior are unknown to the investigator. Unlike 漏洞 scanning where you 測試 for known CVEs, 後門 forensics is an open-ended search for unknown hidden behaviors in a high-dimensional parameter space. The field has developed several families of 偵測 techniques, each with different assumptions, strengths, and computational requirements.

This article covers the forensic workflow for 後門偵測: from initial suspicion through confirmation, trigger recovery, impact 評估, and attribution. We reference the MITRE ATLAS framework (specifically techniques under AML.T0043 - Craft 對抗性 Data and AML.T0020 - Poison Training Data) for mapping findings to standardized threat intelligence.

後門 Implantation Methods

理解 how backdoors are implanted informs the forensic approach. Different implantation methods leave different forensic signatures.

Data Poisoning Backdoors

The most common 後門 implantation method involves injecting poisoned samples into the 訓練資料. Each poisoned sample contains the trigger pattern and is labeled with 攻擊者's target class. During 訓練, 模型 learns to associate the trigger pattern with the target 輸出.

"""
後門 implantation simulator for forensic research.
 
This module demonstrates how 資料投毒 backdoors work
to help forensic investigators 理解 what they are looking for.
FOR RESEARCH AND EDUCATION ONLY.
"""
import numpy as np
from typing import Callable
 
 
def simulate_patch_trigger(
    clean_image: np.ndarray,
    patch: np.ndarray,
    position: tuple[int, int] = (0, 0),
) -> np.ndarray:
    """
    Apply a patch-based trigger to an image.
 
    這是 the simplest trigger type: a small pattern placed at
    a fixed location. Early 後門 attacks (Gu et al., 2019)
    used this approach.
    """
    triggered = clean_image.copy()
    y, x = position
    ph, pw = patch.shape[:2]
    triggered[y:y+ph, x:x+pw] = patch
    return triggered
 
 
def simulate_blended_trigger(
    clean_image: np.ndarray,
    trigger_pattern: np.ndarray,
    blend_ratio: float = 0.1,
) -> np.ndarray:
    """
    Apply a blended trigger that is mixed with the clean image.
 
    Blended triggers (Chen et al., 2017) are harder to detect visually
    因為 the trigger is spread across the entire image at low opacity.
    """
    triggered = (
        (1 - blend_ratio) * clean_image.astype(np.float64)
        + blend_ratio * trigger_pattern.astype(np.float64)
    )
    return np.clip(triggered, 0, 255).astype(clean_image.dtype)
 
 
def calculate_poisoning_rate_needed(
    dataset_size: int,
    model_capacity: str = "medium",
) -> dict:
    """
    Estimate the 投毒 rate needed for a successful 後門.
 
    Research shows that surprisingly low 投毒 rates suffice:
    - Simple patch triggers: 0.1% - 1% of 訓練資料
    - Blended triggers: 1% - 5% of 訓練資料
    - Clean-label attacks: 5% - 10% of 訓練資料
    """
    rates = {
        "small": {"patch": 0.01, "blended": 0.05, "clean_label": 0.10},
        "medium": {"patch": 0.005, "blended": 0.03, "clean_label": 0.08},
        "large": {"patch": 0.001, "blended": 0.01, "clean_label": 0.05},
    }
    model_rates = rates.get(model_capacity, rates["medium"])
    return {
        method: {
            "rate": rate,
            "samples_needed": int(np.ceil(dataset_size * rate)),
        }
        for method, rate in model_rates.items()
    }

Weight-Space Backdoors

More sophisticated attackers can inject backdoors directly by manipulating model weights, without requiring access to the 訓練資料 or procedure. These attacks modify specific neurons or weight matrices to create the trigger-response association.

Clean-Label Backdoors

Clean-label attacks are particularly forensically challenging 因為 the poisoned 訓練 samples have correct labels. 攻擊者 crafts inputs that contain an 對抗性 perturbation (the trigger) but are correctly classified by humans. During 訓練, 模型 learns to associate the perturbation pattern with the target class. 因為 the labels are correct, standard data auditing will not flag these samples.

偵測 Techniques

Neural Cleanse

Neural Cleanse, introduced by Wang et al. (2019), is the foundational 後門偵測 algorithm. The key insight is that if a model contains a 後門, there exists a small perturbation (the trigger) that causes all inputs to be classified as the target class. Neural Cleanse searches for this perturbation by solving an optimization problem 對每個 potential target class.

import torch
import torch.nn as nn
import torch.optim as optim
 
 
def neural_cleanse_scan(
    model: nn.Module,
    num_classes: int,
    input_shape: tuple[int, ...],
    device: str = "cpu",
    epochs: int = 100,
    lr: float = 0.01,
    lambda_l1: float = 0.01,
) -> dict:
    """
    Run Neural Cleanse 後門 偵測 on a model.
 
    對每個 class, optimize a minimal trigger pattern that causes
    all inputs to be classified as that class. If one class requires
    a significantly smaller trigger than others, it is likely the
    後門 target class.
 
    Based on Wang et al. 2019 -- 'Neural Cleanse: Identifying and
    Mitigating 後門 攻擊 in Neural Networks' (IEEE S&P 2019).
 
    Args:
        model: 模型 to scan.
        num_classes: Number of 輸出 classes.
        input_shape: Shape of a single 輸入 (C, H, W) for images.
        device: Compute device.
        epochs: Optimization epochs per class.
        lr: Learning rate for trigger optimization.
        lambda_l1: L1 regularization weight on trigger size.
 
    Returns:
        Dict with per-class trigger norms and anomaly 偵測 results.
    """
    model.eval()
    model.to(device)
    results = {"per_class": {}, "anomaly_index": None, "backdoor_detected": False}
 
    trigger_norms = []
 
    for target_class in range(num_classes):
        # Initialize trigger pattern (mask and pattern)
        mask = torch.zeros(1, 1, *input_shape[1:], device=device, requires_grad=True)
        pattern = torch.zeros(1, *input_shape, device=device, requires_grad=True)
 
        optimizer = optim.Adam([mask, pattern], lr=lr)
        target = torch.tensor([target_class], device=device)
 
        best_norm = float('inf')
        best_mask = None
        best_pattern = None
 
        for epoch in range(epochs):
            optimizer.zero_grad()
 
            # Apply trigger: x_triggered = (1 - mask) * x + mask * pattern
            # Use a batch of random inputs for robustness
            x_batch = torch.rand(16, *input_shape, device=device)
            mask_sigmoid = torch.sigmoid(mask)
            x_triggered = (1 - mask_sigmoid) * x_batch + mask_sigmoid * pattern
 
            輸出 = model(x_triggered)
            loss_cls = nn.CrossEntropyLoss()(
                輸出, target.expand(x_batch.size(0))
            )
            loss_l1 = lambda_l1 * torch.sum(torch.abs(mask_sigmoid))
            loss = loss_cls + loss_l1
 
            loss.backward()
            optimizer.step()
 
            # Clamp pattern to valid 輸入 range
            with torch.no_grad():
                pattern.clamp_(0, 1)
 
            current_norm = float(torch.sum(torch.abs(mask_sigmoid)).item())
            if float(loss_cls.item()) < 0.1 and current_norm < best_norm:
                best_norm = current_norm
                best_mask = mask_sigmoid.detach().clone()
                best_pattern = pattern.detach().clone()
 
        trigger_norms.append(best_norm)
        results["per_class"][target_class] = {
            "trigger_l1_norm": best_norm,
            "optimization_converged": best_mask is not None,
        }
 
    # Anomaly 偵測: use Median Absolute Deviation (MAD)
    norms = np.array(trigger_norms)
    median_norm = float(np.median(norms))
    mad = float(np.median(np.abs(norms - median_norm)))
    if mad > 0:
        anomaly_indices = (median_norm - norms) / (1.4826 * mad)
        most_anomalous = int(np.argmax(anomaly_indices))
        results["anomaly_index"] = float(anomaly_indices[most_anomalous])
        results["suspected_target_class"] = most_anomalous
        # Threshold of 2.0 is commonly used in the literature
        results["backdoor_detected"] = float(anomaly_indices[most_anomalous]) > 2.0
    else:
        results["anomaly_index"] = 0.0
        results["backdoor_detected"] = False
 
    return results

Activation Clustering

Activation Clustering (Chen et al., 2019) detects backdoors by analyzing the internal representations (activations) of 模型 when processing 訓練資料. The premise is that poisoned samples form a distinct cluster in activation space, separate from clean samples of the same class.

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
 
 
def activation_clustering_scan(
    model: nn.Module,
    data_loader: torch.utils.data.DataLoader,
    layer_name: str,
    num_classes: int,
    device: str = "cpu",
) -> dict:
    """
    Detect backdoors using activation clustering analysis.
 
    對每個 class, extract activations from a chosen layer,
    reduce dimensionality, and cluster. If a class contains
    a distinct sub-cluster, it may contain poisoned samples.
 
    Based on Chen et al. 2019 -- 'Detecting 後門 攻擊 on
    Deep Neural Networks by Activation Clustering' (AAAI Workshop).
    """
    model.eval()
    model.to(device)
 
    # Register hook to capture activations
    activations = {}
 
    def hook_fn(module, 輸入, 輸出):
        activations["current"] = 輸出.detach().cpu()
 
    # Find the target layer and register hook
    target_layer = dict(model.named_modules())[layer_name]
    handle = target_layer.register_forward_hook(hook_fn)
 
    # Collect activations per class
    class_activations: dict[int, list] = {c: [] for c in range(num_classes)}
 
    with torch.no_grad():
        for inputs, labels in data_loader:
            inputs = inputs.to(device)
            model(inputs)
            acts = activations["current"]
            # Flatten activations
            acts_flat = acts.view(acts.size(0), -1).numpy()
            for i, label in enumerate(labels.numpy()):
                class_activations[int(label)].append(acts_flat[i])
 
    handle.remove()
 
    # Analyze each class for sub-clusters
    results = {"per_class": {}, "suspicious_classes": []}
 
    for cls, acts_list in class_activations.items():
        if len(acts_list) < 10:
            continue
 
        acts_array = np.array(acts_list)
 
        # PCA to reduce dimensionality
        n_components = min(10, acts_array.shape[1], acts_array.shape[0] - 1)
        pca = PCA(n_components=n_components)
        acts_reduced = pca.fit_transform(acts_array)
 
        # Try 2-cluster KMeans
        kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
        cluster_labels = kmeans.fit_predict(acts_reduced)
 
        # Analyze cluster sizes
        cluster_sizes = [
            int(np.sum(cluster_labels == 0)),
            int(np.sum(cluster_labels == 1)),
        ]
        size_ratio = min(cluster_sizes) / max(cluster_sizes)
 
        sil_score = float(silhouette_score(acts_reduced, cluster_labels))
 
        class_result = {
            "sample_count": len(acts_list),
            "cluster_sizes": cluster_sizes,
            "size_ratio": round(size_ratio, 4),
            "silhouette_score": round(sil_score, 4),
            "suspicious": sil_score > 0.5 and 0.01 < size_ratio < 0.3,
        }
 
        results["per_class"][cls] = class_result
 
        if class_result["suspicious"]:
            results["suspicious_classes"].append({
                "class": cls,
                "smaller_cluster_size": min(cluster_sizes),
                "silhouette_score": sil_score,
            })
 
    results["backdoor_detected"] = len(results["suspicious_classes"]) > 0
    return results

Spectral Signatures

Spectral Signatures (Tran et al., 2018) detects poisoned data points by analyzing the spectrum of the covariance matrix of learned representations. Poisoned samples leave a detectable signature in the top singular vectors of the representation matrix.

def spectral_signature_scan(
    representations: np.ndarray,
    labels: np.ndarray,
    num_classes: int,
    epsilon: float = 1.5,
) -> dict:
    """
    Detect poisoned samples using spectral signatures.
 
    對每個 class, compute the top singular vector of the centered
    representation matrix. Poisoned samples will have anomalously
    high correlation with this vector.
 
    Based on Tran et al. 2018 -- 'Spectral Signatures in 後門
    攻擊' (NeurIPS 2018).
    """
    results = {"per_class": {}, "flagged_indices": []}
 
    for cls in range(num_classes):
        mask = labels == cls
        if mask.sum() < 5:
            continue
 
        class_reps = representations[mask]
        centered = class_reps - class_reps.mean(axis=0)
 
        # Compute top singular vector
        _, s, vt = np.linalg.svd(centered, full_matrices=False)
        top_v = vt[0]
 
        # Project each sample onto top singular vector
        scores = np.abs(centered @ top_v)
 
        # Flag outliers using MAD
        median_score = np.median(scores)
        mad = np.median(np.abs(scores - median_score))
        threshold = median_score + epsilon * 1.4826 * mad
 
        flagged_mask = scores > threshold
        flagged_local_indices = np.where(flagged_mask)[0]
 
        # Map back to global indices
        global_indices = np.where(mask)[0]
        flagged_global = global_indices[flagged_local_indices]
 
        results["per_class"][cls] = {
            "sample_count": int(mask.sum()),
            "flagged_count": int(flagged_mask.sum()),
            "top_singular_value": float(s[0]),
            "score_threshold": float(threshold),
        }
 
        results["flagged_indices"].extend(flagged_global.tolist())
 
    results["total_flagged"] = len(results["flagged_indices"])
    return results

Forensic Analysis Workflow

Phase 1: Triage

Before running computationally expensive 偵測 algorithms, conduct initial triage:

Provenance check: Where did 模型 come from? Was it trained in-house, fine-tuned from a public checkpoint, or obtained from a third party?
供應鏈 audit: Were there any anomalies in the 訓練資料 pipeline, model registry, or deployment process?
Behavioral screening: Run 模型 on a diverse 測試 set and examine outputs for unexpected patterns, especially consistent misclassifications on specific 輸入 types.

Phase 2: 偵測

Run multiple 偵測 algorithms, as each has different strengths:

Method	Strengths	Weaknesses	Compute Cost
Neural Cleanse	Works without clean data access	Misses complex triggers	High (optimization per class)
Activation Clustering	Detects 資料投毒 backdoors	Requires representative data	Medium
Spectral Signatures	Strong theoretical guarantees	Assumes separable representations	Low
STRIP (輸入 perturbation)	Model-agnostic	High false positive rate	Medium
Meta Neural Analysis	Detects novel 後門 types	Requires 訓練 a meta-classifier	Very High

Phase 3: Trigger Recovery

If a 後門 is detected, recover the trigger pattern for analysis. The Neural Cleanse optimization produces a trigger estimate. Refine it using:

def refine_trigger_estimate(
    model: nn.Module,
    initial_mask: torch.Tensor,
    initial_pattern: torch.Tensor,
    target_class: int,
    validation_data: torch.utils.data.DataLoader,
    device: str = "cpu",
    refinement_epochs: int = 200,
) -> dict:
    """
    Refine a recovered trigger estimate against validation data.
 
    Uses a larger dataset and more optimization steps to produce
    a higher-fidelity trigger reconstruction.
    """
    model.eval()
    model.to(device)
 
    mask = initial_mask.clone().to(device).requires_grad_(True)
    pattern = initial_pattern.clone().to(device).requires_grad_(True)
 
    optimizer = optim.Adam([mask, pattern], lr=0.005)
    target = torch.tensor([target_class], device=device)
 
    success_rates = []
 
    for epoch in range(refinement_epochs):
        total, correct = 0, 0
        for inputs, _ in validation_data:
            inputs = inputs.to(device)
            optimizer.zero_grad()
 
            mask_sigmoid = torch.sigmoid(mask)
            x_triggered = (1 - mask_sigmoid) * inputs + mask_sigmoid * pattern
            輸出 = model(x_triggered)
 
            loss = nn.CrossEntropyLoss()(
                輸出, target.expand(inputs.size(0))
            ) + 0.005 * torch.sum(torch.abs(mask_sigmoid))
 
            loss.backward()
            optimizer.step()
 
            with torch.no_grad():
                pattern.clamp_(0, 1)
                preds = 輸出.argmax(dim=1)
                correct += (preds == target_class).sum().item()
                total += inputs.size(0)
 
        success_rates.append(correct / max(total, 1))
 
    return {
        "refined_mask": torch.sigmoid(mask).detach(),
        "refined_pattern": pattern.detach(),
        "trigger_l1_norm": float(torch.sum(torch.abs(torch.sigmoid(mask))).item()),
        "attack_success_rate": success_rates[-1] if success_rates else 0.0,
        "convergence_history": success_rates,
    }

Phase 4: Impact 評估

Determine the scope and severity of the 後門:

Target behavior: What does 模型 do when the trigger is present? (misclassify, produce specific 輸出, leak data)
Trigger specificity: How specific is the trigger? (broad pattern vs. exact pixel arrangement)
攻擊 success rate: What fraction of triggered inputs produce the target behavior?
Clean accuracy impact: Does the 後門 affect performance on clean inputs?
Deployment exposure: How long was the backdoored model deployed? How many inputs were processed?

Phase 5: Attribution

Map findings to threat intelligence:

MITRE ATLAS mapping: AML.T0020 (Poison Training Data) for 資料投毒 backdoors; AML.T0043 for crafted 對抗性 triggers
攻擊 toolkit identification: Compare recovered trigger characteristics against known 後門 attack implementations (TrojanZoo, BackdoorBench)
Threat actor profiling: The sophistication of the 後門 (patch vs. clean-label vs. weight-space) indicates 攻擊者's capability level

Tools and Frameworks

Several open-source tools support 後門偵測 forensics:

TrojanZoo (github.com/ain-soph/trojanzoo): Comprehensive framework for 後門 attack and 防禦 research, 實作 many 偵測 algorithms
BackdoorBench (github.com/SCLBD/BackdoorBench): Benchmarking platform for 後門 learning with standardized 評估
對抗性 Robustness Toolbox (ART): IBM's 安全 library includes 後門偵測 methods including Neural Cleanse and Activation Clustering
NIST TrojAI: NIST's Trojan 偵測 Challenge provides standardized 評估 datasets for 後門偵測 algorithms

參考文獻

Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., & Zhao, B. Y. (2019). Neural Cleanse: Identifying and Mitigating 後門攻擊 in Neural Networks. IEEE Symposium on 安全 and Privacy (S&P). https://doi.org/10.1109/SP.2019.00031
Tran, B., Li, J., & Madry, A. (2018). Spectral Signatures in 後門攻擊. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1811.00636
Gu, T., Liu, K., Dolan-Gavitt, B., & Garg, S. (2019). BadNets: Evaluating Backdooring 攻擊 on Deep Neural Networks. IEEE Access, 7, 47230-47244. https://doi.org/10.1109/ACCESS.2019.2909068
MITRE ATLAS. (2024). 對抗性 Threat Landscape for Artificial Intelligence Systems. https://atlas.mitre.org/

模型 Backdoor Detection Forensics

Advanced12 min readUpdated 2026-03-20

Forensic techniques for detecting, analyzing, and attributing backdoors implanted in machine learning models through training-time or post-training attacks.

ai-forensics-ir backdoor-detection model-security trojan-detection

"""
後門 implantation simulator for forensic research.
 
This module demonstrates how 資料投毒 backdoors work
to help forensic investigators 理解 what they are looking for.
FOR RESEARCH AND EDUCATION ONLY.
"""
import numpy as np
from typing import Callable
 
 
def simulate_patch_trigger(
    clean_image: np.ndarray,
    patch: np.ndarray,
    position: tuple[int, int] = (0, 0),
) -> np.ndarray:
    """
    Apply a patch-based trigger to an image.
 
    這是 the simplest trigger type: a small pattern placed at
    a fixed location. Early 後門 attacks (Gu et al., 2019)
    used this approach.
    """
    triggered = clean_image.copy()
    y, x = position
    ph, pw = patch.shape[:2]
    triggered[y:y+ph, x:x+pw] = patch
    return triggered
 
 
def simulate_blended_trigger(
    clean_image: np.ndarray,
    trigger_pattern: np.ndarray,
    blend_ratio: float = 0.1,
) -> np.ndarray:
    """
    Apply a blended trigger that is mixed with the clean image.
 
    Blended triggers (Chen et al., 2017) are harder to detect visually
    因為 the trigger is spread across the entire image at low opacity.
    """
    triggered = (
        (1 - blend_ratio) * clean_image.astype(np.float64)
        + blend_ratio * trigger_pattern.astype(np.float64)
    )
    return np.clip(triggered, 0, 255).astype(clean_image.dtype)
 
 
def calculate_poisoning_rate_needed(
    dataset_size: int,
    model_capacity: str = "medium",
) -> dict:
    """
    Estimate the 投毒 rate needed for a successful 後門.
 
    Research shows that surprisingly low 投毒 rates suffice:
    - Simple patch triggers: 0.1% - 1% of 訓練資料
    - Blended triggers: 1% - 5% of 訓練資料
    - Clean-label attacks: 5% - 10% of 訓練資料
    """
    rates = {
        "small": {"patch": 0.01, "blended": 0.05, "clean_label": 0.10},
        "medium": {"patch": 0.005, "blended": 0.03, "clean_label": 0.08},
        "large": {"patch": 0.001, "blended": 0.01, "clean_label": 0.05},
    }
    model_rates = rates.get(model_capacity, rates["medium"])
    return {
        method: {
            "rate": rate,
            "samples_needed": int(np.ceil(dataset_size * rate)),
        }
        for method, rate in model_rates.items()
    }

import torch
import torch.nn as nn
import torch.optim as optim
 
 
def neural_cleanse_scan(
    model: nn.Module,
    num_classes: int,
    input_shape: tuple[int, ...],
    device: str = "cpu",
    epochs: int = 100,
    lr: float = 0.01,
    lambda_l1: float = 0.01,
) -> dict:
    """
    Run Neural Cleanse 後門 偵測 on a model.
 
    對每個 class, optimize a minimal trigger pattern that causes
    all inputs to be classified as that class. If one class requires
    a significantly smaller trigger than others, it is likely the
    後門 target class.
 
    Based on Wang et al. 2019 -- 'Neural Cleanse: Identifying and
    Mitigating 後門 攻擊 in Neural Networks' (IEEE S&P 2019).
 
    Args:
        model: 模型 to scan.
        num_classes: Number of 輸出 classes.
        input_shape: Shape of a single 輸入 (C, H, W) for images.
        device: Compute device.
        epochs: Optimization epochs per class.
        lr: Learning rate for trigger optimization.
        lambda_l1: L1 regularization weight on trigger size.
 
    Returns:
        Dict with per-class trigger norms and anomaly 偵測 results.
    """
    model.eval()
    model.to(device)
    results = {"per_class": {}, "anomaly_index": None, "backdoor_detected": False}
 
    trigger_norms = []
 
    for target_class in range(num_classes):
        # Initialize trigger pattern (mask and pattern)
        mask = torch.zeros(1, 1, *input_shape[1:], device=device, requires_grad=True)
        pattern = torch.zeros(1, *input_shape, device=device, requires_grad=True)
 
        optimizer = optim.Adam([mask, pattern], lr=lr)
        target = torch.tensor([target_class], device=device)
 
        best_norm = float('inf')
        best_mask = None
        best_pattern = None
 
        for epoch in range(epochs):
            optimizer.zero_grad()
 
            # Apply trigger: x_triggered = (1 - mask) * x + mask * pattern
            # Use a batch of random inputs for robustness
            x_batch = torch.rand(16, *input_shape, device=device)
            mask_sigmoid = torch.sigmoid(mask)
            x_triggered = (1 - mask_sigmoid) * x_batch + mask_sigmoid * pattern
 
            輸出 = model(x_triggered)
            loss_cls = nn.CrossEntropyLoss()(
                輸出, target.expand(x_batch.size(0))
            )
            loss_l1 = lambda_l1 * torch.sum(torch.abs(mask_sigmoid))
            loss = loss_cls + loss_l1
 
            loss.backward()
            optimizer.step()
 
            # Clamp pattern to valid 輸入 range
            with torch.no_grad():
                pattern.clamp_(0, 1)
 
            current_norm = float(torch.sum(torch.abs(mask_sigmoid)).item())
            if float(loss_cls.item()) < 0.1 and current_norm < best_norm:
                best_norm = current_norm
                best_mask = mask_sigmoid.detach().clone()
                best_pattern = pattern.detach().clone()
 
        trigger_norms.append(best_norm)
        results["per_class"][target_class] = {
            "trigger_l1_norm": best_norm,
            "optimization_converged": best_mask is not None,
        }
 
    # Anomaly 偵測: use Median Absolute Deviation (MAD)
    norms = np.array(trigger_norms)
    median_norm = float(np.median(norms))
    mad = float(np.median(np.abs(norms - median_norm)))
    if mad > 0:
        anomaly_indices = (median_norm - norms) / (1.4826 * mad)
        most_anomalous = int(np.argmax(anomaly_indices))
        results["anomaly_index"] = float(anomaly_indices[most_anomalous])
        results["suspected_target_class"] = most_anomalous
        # Threshold of 2.0 is commonly used in the literature
        results["backdoor_detected"] = float(anomaly_indices[most_anomalous]) > 2.0
    else:
        results["anomaly_index"] = 0.0
        results["backdoor_detected"] = False
 
    return results

Activation Clustering

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
 
 
def activation_clustering_scan(
    model: nn.Module,
    data_loader: torch.utils.data.DataLoader,
    layer_name: str,
    num_classes: int,
    device: str = "cpu",
) -> dict:
    """
    Detect backdoors using activation clustering analysis.
 
    對每個 class, extract activations from a chosen layer,
    reduce dimensionality, and cluster. If a class contains
    a distinct sub-cluster, it may contain poisoned samples.
 
    Based on Chen et al. 2019 -- 'Detecting 後門 攻擊 on
    Deep Neural Networks by Activation Clustering' (AAAI Workshop).
    """
    model.eval()
    model.to(device)
 
    # Register hook to capture activations
    activations = {}
 
    def hook_fn(module, 輸入, 輸出):
        activations["current"] = 輸出.detach().cpu()
 
    # Find the target layer and register hook
    target_layer = dict(model.named_modules())[layer_name]
    handle = target_layer.register_forward_hook(hook_fn)
 
    # Collect activations per class
    class_activations: dict[int, list] = {c: [] for c in range(num_classes)}
 
    with torch.no_grad():
        for inputs, labels in data_loader:
            inputs = inputs.to(device)
            model(inputs)
            acts = activations["current"]
            # Flatten activations
            acts_flat = acts.view(acts.size(0), -1).numpy()
            for i, label in enumerate(labels.numpy()):
                class_activations[int(label)].append(acts_flat[i])
 
    handle.remove()
 
    # Analyze each class for sub-clusters
    results = {"per_class": {}, "suspicious_classes": []}
 
    for cls, acts_list in class_activations.items():
        if len(acts_list) < 10:
            continue
 
        acts_array = np.array(acts_list)
 
        # PCA to reduce dimensionality
        n_components = min(10, acts_array.shape[1], acts_array.shape[0] - 1)
        pca = PCA(n_components=n_components)
        acts_reduced = pca.fit_transform(acts_array)
 
        # Try 2-cluster KMeans
        kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
        cluster_labels = kmeans.fit_predict(acts_reduced)
 
        # Analyze cluster sizes
        cluster_sizes = [
            int(np.sum(cluster_labels == 0)),
            int(np.sum(cluster_labels == 1)),
        ]
        size_ratio = min(cluster_sizes) / max(cluster_sizes)
 
        sil_score = float(silhouette_score(acts_reduced, cluster_labels))
 
        class_result = {
            "sample_count": len(acts_list),
            "cluster_sizes": cluster_sizes,
            "size_ratio": round(size_ratio, 4),
            "silhouette_score": round(sil_score, 4),
            "suspicious": sil_score > 0.5 and 0.01 < size_ratio < 0.3,
        }
 
        results["per_class"][cls] = class_result
 
        if class_result["suspicious"]:
            results["suspicious_classes"].append({
                "class": cls,
                "smaller_cluster_size": min(cluster_sizes),
                "silhouette_score": sil_score,
            })
 
    results["backdoor_detected"] = len(results["suspicious_classes"]) > 0
    return results

Spectral Signatures

def spectral_signature_scan(
    representations: np.ndarray,
    labels: np.ndarray,
    num_classes: int,
    epsilon: float = 1.5,
) -> dict:
    """
    Detect poisoned samples using spectral signatures.
 
    對每個 class, compute the top singular vector of the centered
    representation matrix. Poisoned samples will have anomalously
    high correlation with this vector.
 
    Based on Tran et al. 2018 -- 'Spectral Signatures in 後門
    攻擊' (NeurIPS 2018).
    """
    results = {"per_class": {}, "flagged_indices": []}
 
    for cls in range(num_classes):
        mask = labels == cls
        if mask.sum() < 5:
            continue
 
        class_reps = representations[mask]
        centered = class_reps - class_reps.mean(axis=0)
 
        # Compute top singular vector
        _, s, vt = np.linalg.svd(centered, full_matrices=False)
        top_v = vt[0]
 
        # Project each sample onto top singular vector
        scores = np.abs(centered @ top_v)
 
        # Flag outliers using MAD
        median_score = np.median(scores)
        mad = np.median(np.abs(scores - median_score))
        threshold = median_score + epsilon * 1.4826 * mad
 
        flagged_mask = scores > threshold
        flagged_local_indices = np.where(flagged_mask)[0]
 
        # Map back to global indices
        global_indices = np.where(mask)[0]
        flagged_global = global_indices[flagged_local_indices]
 
        results["per_class"][cls] = {
            "sample_count": int(mask.sum()),
            "flagged_count": int(flagged_mask.sum()),
            "top_singular_value": float(s[0]),
            "score_threshold": float(threshold),
        }
 
        results["flagged_indices"].extend(flagged_global.tolist())
 
    results["total_flagged"] = len(results["flagged_indices"])
    return results

Forensic Analysis Workflow

Phase 1: Triage

Before running computationally expensive 偵測 algorithms, conduct initial triage:

Provenance check: Where did 模型 come from? Was it trained in-house, fine-tuned from a public checkpoint, or obtained from a third party?
供應鏈 audit: Were there any anomalies in the 訓練資料 pipeline, model registry, or deployment process?
Behavioral screening: Run 模型 on a diverse 測試 set and examine outputs for unexpected patterns, especially consistent misclassifications on specific 輸入 types.

Phase 2: 偵測

Run multiple 偵測 algorithms, as each has different strengths:

Method	Strengths	Weaknesses	Compute Cost
Neural Cleanse	Works without clean data access	Misses complex triggers	High (optimization per class)
Activation Clustering	Detects 資料投毒 backdoors	Requires representative data	Medium
Spectral Signatures	Strong theoretical guarantees	Assumes separable representations	Low
STRIP (輸入 perturbation)	Model-agnostic	High false positive rate	Medium
Meta Neural Analysis	Detects novel 後門 types	Requires 訓練 a meta-classifier	Very High

Phase 3: Trigger Recovery

If a 後門 is detected, recover the trigger pattern for analysis. The Neural Cleanse optimization produces a trigger estimate. Refine it using:

def refine_trigger_estimate(
    model: nn.Module,
    initial_mask: torch.Tensor,
    initial_pattern: torch.Tensor,
    target_class: int,
    validation_data: torch.utils.data.DataLoader,
    device: str = "cpu",
    refinement_epochs: int = 200,
) -> dict:
    """
    Refine a recovered trigger estimate against validation data.
 
    Uses a larger dataset and more optimization steps to produce
    a higher-fidelity trigger reconstruction.
    """
    model.eval()
    model.to(device)
 
    mask = initial_mask.clone().to(device).requires_grad_(True)
    pattern = initial_pattern.clone().to(device).requires_grad_(True)
 
    optimizer = optim.Adam([mask, pattern], lr=0.005)
    target = torch.tensor([target_class], device=device)
 
    success_rates = []
 
    for epoch in range(refinement_epochs):
        total, correct = 0, 0
        for inputs, _ in validation_data:
            inputs = inputs.to(device)
            optimizer.zero_grad()
 
            mask_sigmoid = torch.sigmoid(mask)
            x_triggered = (1 - mask_sigmoid) * inputs + mask_sigmoid * pattern
            輸出 = model(x_triggered)
 
            loss = nn.CrossEntropyLoss()(
                輸出, target.expand(inputs.size(0))
            ) + 0.005 * torch.sum(torch.abs(mask_sigmoid))
 
            loss.backward()
            optimizer.step()
 
            with torch.no_grad():
                pattern.clamp_(0, 1)
                preds = 輸出.argmax(dim=1)
                correct += (preds == target_class).sum().item()
                total += inputs.size(0)
 
        success_rates.append(correct / max(total, 1))
 
    return {
        "refined_mask": torch.sigmoid(mask).detach(),
        "refined_pattern": pattern.detach(),
        "trigger_l1_norm": float(torch.sum(torch.abs(torch.sigmoid(mask))).item()),
        "attack_success_rate": success_rates[-1] if success_rates else 0.0,
        "convergence_history": success_rates,
    }

Phase 4: Impact 評估

Determine the scope and severity of the 後門:

Target behavior: What does 模型 do when the trigger is present? (misclassify, produce specific 輸出, leak data)
Trigger specificity: How specific is the trigger? (broad pattern vs. exact pixel arrangement)
攻擊 success rate: What fraction of triggered inputs produce the target behavior?
Clean accuracy impact: Does the 後門 affect performance on clean inputs?
Deployment exposure: How long was the backdoored model deployed? How many inputs were processed?

Phase 5: Attribution

Map findings to threat intelligence:

MITRE ATLAS mapping: AML.T0020 (Poison Training Data) for 資料投毒 backdoors; AML.T0043 for crafted 對抗性 triggers
攻擊 toolkit identification: Compare recovered trigger characteristics against known 後門 attack implementations (TrojanZoo, BackdoorBench)
Threat actor profiling: The sophistication of the 後門 (patch vs. clean-label vs. weight-space) indicates 攻擊者's capability level

Tools and Frameworks

Several open-source tools support 後門偵測 forensics:

TrojanZoo (github.com/ain-soph/trojanzoo): Comprehensive framework for 後門 attack and 防禦 research, 實作 many 偵測 algorithms
BackdoorBench (github.com/SCLBD/BackdoorBench): Benchmarking platform for 後門 learning with standardized 評估
對抗性 Robustness Toolbox (ART): IBM's 安全 library includes 後門偵測 methods including Neural Cleanse and Activation Clustering
NIST TrojAI: NIST's Trojan 偵測 Challenge provides standardized 評估 datasets for 後門偵測 algorithms

參考文獻

Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., & Zhao, B. Y. (2019). Neural Cleanse: Identifying and Mitigating 後門攻擊 in Neural Networks. IEEE Symposium on 安全 and Privacy (S&P). https://doi.org/10.1109/SP.2019.00031
Tran, B., Li, J., & Madry, A. (2018). Spectral Signatures in 後門攻擊. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/1811.00636
Gu, T., Liu, K., Dolan-Gavitt, B., & Garg, S. (2019). BadNets: Evaluating Backdooring 攻擊 on Deep Neural Networks. IEEE Access, 7, 47230-47244. https://doi.org/10.1109/ACCESS.2019.2909068
MITRE ATLAS. (2024). 對抗性 Threat Landscape for Artificial Intelligence Systems. https://atlas.mitre.org/

模型 Backdoor Detection Forensics

Related articles

模型 Backdoor Detection Forensics

Related articles