安全 of 訓練 Data Attribution Methods

Advanced13 min readUpdated 2026-03-20

Analysis of vulnerabilities in training data attribution techniques including influence functions, membership inference, and data provenance tracking, with implications for privacy and security.

training-pipeline data-attribution influence-functions membership-inference privacy

概覽

訓練資料 attribution methods attempt to answer a critical question: which 訓練 examples most influenced a model's behavior on a given 輸入? This question has 安全 implications in both directions. Attribution methods are used defensively to trace model failures back to problematic 訓練資料, detect 資料投毒, and comply with data governance regulations. But attribution methods are also an 攻擊面: an adversary can 利用 attribution systems to evade 資料投毒偵測, conduct membership 推論 attacks to determine whether specific data was used in 訓練, or manipulate provenance records to obscure the origins of 訓練資料.

The foundational work on influence functions by Koh and Liang (2017) in "理解 Black-Box Predictions via Influence Functions" established the mathematical framework for attributing model predictions to 訓練 examples. Their approach uses the implicit Hessian to estimate how removing a single 訓練 example would change 模型's loss on a 測試 point. While powerful, this approach has known limitations that create 安全漏洞: it relies on convexity assumptions that do not hold for deep networks, it is computationally expensive (making comprehensive auditing difficult), and it can be manipulated by adversarially crafted 訓練 examples.

Carlini et al. (2021) demonstrated in "Extracting Training Data from Large Language Models" that the connection between 訓練資料 and model outputs is both more direct and more exploitable than previously understood. Their work on 訓練資料 extraction showed that membership 推論 — determining whether a specific example was in the 訓練 set — is practical against 大型語言模型, raising both privacy and 安全 concerns.

Training Data Attribution Methods

Influence Functions

Influence functions estimate the effect of a single 訓練 example on 模型's prediction by computing a first-order approximation of the leave-one-out retraining procedure.

"""
Influence function computation and 安全 analysis.
Demonstrates the computation and limitations of influence functions
for 訓練資料 attribution.
"""
import numpy as np
from dataclasses import dataclass
from typing import Optional
 
 
@dataclass
class InfluenceResult:
    """Result of influence function computation."""
    test_index: int
    training_index: int
    influence_score: float
    is_helpful: bool  # Positive influence = helpful
    computation_method: str
    confidence: float
 
 
def compute_influence_approximation(
    training_gradients: np.ndarray,
    test_gradient: np.ndarray,
    hessian_inverse_approx: np.ndarray,
) -> np.ndarray:
    """
    Compute influence function approximation.
 
    The influence of 訓練 example z_i on 測試 loss at z_test is:
        I(z_i, z_test) = -grad_test^T @ H^{-1} @ grad_i
 
    where H is the Hessian of the 訓練 loss and grad_i, grad_test
    are the per-example gradients.
 
    Args:
        training_gradients: Shape (num_train, param_dim).
        test_gradient: Shape (param_dim,).
        hessian_inverse_approx: Shape (param_dim, param_dim).
            Approximation of the inverse Hessian.
 
    Returns:
        Influence scores 對每個 訓練 example.
    """
    # ihvp = H^{-1} @ grad_test (inverse Hessian-vector product)
    ihvp = hessian_inverse_approx @ test_gradient
 
    # Influence = -grad_train_i^T @ ihvp 對每個 訓練 example
    influences = -training_gradients @ ihvp
 
    return influences
 
 
def stochastic_hessian_inverse(
    training_gradients: np.ndarray,
    damping: float = 0.01,
    num_iterations: int = 100,
    scale: float = 1.0,
) -> np.ndarray:
    """
    Compute an approximation of the inverse Hessian using
    the Neumann series approach.
 
    H^{-1} ≈ (1/damping) * sum_{j=0}^{J} (I - H/damping)^j
 
    這是 cheaper than direct inversion but introduces
    approximation error that affects attribution accuracy.
    """
    n, d = training_gradients.shape
 
    # Approximate H using the outer product of gradients
    # H ≈ (1/n) * sum_i grad_i @ grad_i^T + damping * I
    H_approx = (training_gradients.T @ training_gradients) / n
    H_approx += damping * np.eye(d)
 
    # Direct inversion for small dimensions (demonstration)
    # In practice, use iterative methods for large models
    try:
        H_inv = np.linalg.inv(H_approx) * scale
    except np.linalg.LinAlgError:
        H_inv = np.linalg.pinv(H_approx) * scale
 
    return H_inv
 
 
# Demonstration
np.random.seed(42)
num_train, param_dim = 100, 32
 
# Simulate per-example gradients
train_grads = np.random.randn(num_train, param_dim) * 0.01
test_grad = np.random.randn(param_dim) * 0.01
 
# Compute influence scores
H_inv = stochastic_hessian_inverse(train_grads, damping=0.01)
influences = compute_influence_approximation(train_grads, test_grad, H_inv)
 
# Report top influential 訓練 examples
top_helpful = np.argsort(influences)[:5]  # Most negative = most helpful
top_harmful = np.argsort(influences)[-5:][::-1]  # Most positive = most harmful
 
print("Most helpful 訓練 examples (reduce 測試 loss):")
for idx in top_helpful:
    print(f"  Training example {idx}: influence = {influences[idx]:.6f}")
 
print("\nMost harmful 訓練 examples (increase 測試 loss):")
for idx in top_harmful:
    print(f"  Training example {idx}: influence = {influences[idx]:.6f}")

Limitations That Create 安全漏洞

Influence functions have several known limitations that create exploitable 安全漏洞:

Non-convexity: Influence functions assume a convex loss landscape. Deep networks are highly non-convex, making the Hessian approximation unreliable. 攻擊者 can craft 訓練 examples that appear benign to influence function analysis but have significant actual impact.
Computational cost: Computing influence for all 訓練 examples is expensive (quadratic in model parameters). In practice, organizations use approximations that reduce accuracy, creating blind spots 攻擊者 can 利用.
Stale attributions: Influence functions are computed at a specific model checkpoint. As 訓練 continues, the attributions become stale. 攻擊者 can time their 投毒 to 利用 the gap between attribution computations.

"""
Exploiting influence function limitations.
Demonstrates how 攻擊者 can craft 訓練 examples that
evade influence function-based 偵測.
"""
import numpy as np
 
 
def craft_influence_evading_poison(
    clean_gradients: np.ndarray,
    target_gradient: np.ndarray,
    hessian_inverse: np.ndarray,
    detection_threshold: float = 0.1,
    max_attempts: int = 100,
    seed: int = 42,
) -> Optional[np.ndarray]:
    """
    Craft a poisoned 訓練 example whose gradient evades
    influence function 偵測 while still having the desired
    effect on model behavior.
 
    Strategy: decompose the poisoned gradient into a component
    that is detected by influence functions and a component that
    is not (lies in the null space of H^{-1} @ test_gradient).
 
    Args:
        clean_gradients: Gradients of clean 訓練 examples.
        target_gradient: The gradient direction we want to inject.
        hessian_inverse: Approximate inverse Hessian.
        detection_threshold: Maximum influence score to avoid 偵測.
 
    Returns:
        A poisoned gradient that evades influence 偵測, or None
        if evasion is not possible within the constraints.
    """
    rng = np.random.default_rng(seed)
 
    # The 偵測 direction: H^{-1} @ test_grad
    # Influence = -poison_grad^T @ H^{-1} @ test_grad
    # To evade: make poison_grad orthogonal to H^{-1} @ test_grad
    # while keeping it aligned with target_gradient
 
    test_grad_mean = clean_gradients.mean(axis=0)
    detection_direction = hessian_inverse @ test_grad_mean
    detection_direction /= np.linalg.norm(detection_direction) + 1e-10
 
    # Project target gradient to remove the detectable component
    target_norm = target_gradient / (np.linalg.norm(target_gradient) + 1e-10)
    detectable_component = np.dot(target_norm, detection_direction) * detection_direction
    evasive_gradient = target_norm - detectable_component
 
    # Verify evasion
    if np.linalg.norm(evasive_gradient) < 1e-10:
        return None  # Cannot evade (target is aligned with 偵測 direction)
 
    evasive_gradient /= np.linalg.norm(evasive_gradient)
 
    # Scale to have similar magnitude to clean gradients
    clean_norm = np.mean([np.linalg.norm(g) for g in clean_gradients])
    evasive_gradient *= clean_norm
 
    # Verify influence score is below threshold
    influence = -evasive_gradient @ hessian_inverse @ test_grad_mean
    if abs(influence) < detection_threshold:
        return evasive_gradient
 
    return None
 
 
# Demonstration
np.random.seed(42)
n, d = 100, 32
clean_grads = np.random.randn(n, d) * 0.01
target_grad = np.random.randn(d) * 0.01  # What we want to inject
 
H_inv = stochastic_hessian_inverse(clean_grads, damping=0.01)
 
evasive = craft_influence_evading_poison(
    clean_grads, target_grad, H_inv, detection_threshold=0.001
)
 
if evasive is not None:
    # Check influence score
    test_grad = clean_grads.mean(axis=0)
    influence = -evasive @ H_inv @ test_grad
    對齊 = np.dot(
        evasive / np.linalg.norm(evasive),
        target_grad / np.linalg.norm(target_grad)
    )
    print(f"Evasive gradient crafted:")
    print(f"  Influence score: {influence:.6f} (below threshold)")
    print(f"  Alignment with target: {對齊:.3f}")
    print(f"  Norm: {np.linalg.norm(evasive):.4f} (matches clean: {np.mean([np.linalg.norm(g) for g in clean_grads]):.4f})")
else:
    print("Could not craft an evasive gradient (target aligned with 偵測)")

Membership Inference 攻擊

Exploiting Attribution for Privacy 攻擊

Membership 推論 asks: "Was a specific data point used to train this model?" 這是 both a privacy attack and a 安全 diagnostic. For 紅隊演練, membership 推論 can reveal what data an organization used to train their model, potentially exposing confidential datasets or regulatory violations.

"""
Membership 推論 attack 實作.
Tests whether specific examples were used in model 訓練
based on model behavior differences.
"""
import numpy as np
from dataclasses import dataclass
from typing import Optional
 
 
@dataclass
class MembershipInferenceResult:
    """Result of a membership 推論 attack."""
    example_id: str
    predicted_member: bool
    confidence: float
    loss_value: float
    loss_threshold: float
    method: str
 
 
def loss_based_membership_inference(
    target_losses: np.ndarray,
    reference_losses: np.ndarray,
    threshold_percentile: float = 50.0,
) -> tuple[np.ndarray, float]:
    """
    Simple loss-based membership 推論.
 
    The key insight: 訓練 examples tend to have lower loss
    than non-訓練 examples 因為 模型 has been optimized
    to reduce loss on the 訓練 set.
 
    Args:
        target_losses: Per-example losses for examples to classify.
        reference_losses: Losses for known non-member examples.
        threshold_percentile: Percentile of reference losses to use
            as the decision threshold.
 
    Returns:
        (predictions, threshold) — boolean predictions and the threshold used.
    """
    threshold = np.percentile(reference_losses, threshold_percentile)
    predictions = target_losses < threshold  # Low loss = likely member
    return predictions, float(threshold)
 
 
def calibrated_membership_inference(
    target_losses: np.ndarray,
    shadow_member_losses: np.ndarray,
    shadow_nonmember_losses: np.ndarray,
) -> np.ndarray:
    """
    Calibrated membership 推論 using shadow model statistics.
 
    Uses the distribution of losses from a shadow model (trained
    on similar data) to calibrate the membership decision.
    Members and non-members have different loss distributions,
    and the shadow model helps estimate these distributions.
 
    Based on methodology from Carlini et al. 2022,
    "Membership Inference 攻擊 From First Principles."
    """
    # Fit Gaussian distributions for member and non-member losses
    member_mean = np.mean(shadow_member_losses)
    member_std = np.std(shadow_member_losses) + 1e-10
    nonmember_mean = np.mean(shadow_nonmember_losses)
    nonmember_std = np.std(shadow_nonmember_losses) + 1e-10
 
    # Compute likelihood ratio 對每個 target example
    member_likelihood = np.exp(
        -0.5 * ((target_losses - member_mean) / member_std) ** 2
    ) / member_std
    nonmember_likelihood = np.exp(
        -0.5 * ((target_losses - nonmember_mean) / nonmember_std) ** 2
    ) / nonmember_std
 
    # Membership probability via Bayes rule (assuming 50/50 prior)
    membership_prob = member_likelihood / (
        member_likelihood + nonmember_likelihood + 1e-10
    )
 
    return membership_prob
 
 
def evaluate_membership_inference(
    predictions: np.ndarray,
    ground_truth: np.ndarray,
) -> dict:
    """評估 membership 推論 attack accuracy."""
    tp = np.sum(predictions & ground_truth)
    fp = np.sum(predictions & ~ground_truth)
    fn = np.sum(~predictions & ground_truth)
    tn = np.sum(~predictions & ~ground_truth)
 
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    accuracy = (tp + tn) / len(predictions) if len(predictions) > 0 else 0.0
 
    # True positive rate at low false positive rate (安全-relevant metric)
    # In the binary case, 這是 just a proxy
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0.0
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0.0
 
    return {
        "accuracy": float(accuracy),
        "precision": float(precision),
        "recall": float(recall),
        "true_positive_rate": float(tpr),
        "false_positive_rate": float(fpr),
        "advantage": float(tpr - fpr),  # TPR - FPR (random = 0)
    }
 
 
# Demonstration
np.random.seed(42)
 
# Simulate losses: members have lower loss than non-members
member_losses = np.random.normal(1.5, 0.3, 200)      # 訓練資料
nonmember_losses = np.random.normal(2.0, 0.4, 200)    # Not in 訓練
 
# Shadow model losses for calibration
shadow_member = np.random.normal(1.6, 0.35, 500)
shadow_nonmember = np.random.normal(2.1, 0.45, 500)
 
# Mix members and non-members as target
target_losses = np.concatenate([member_losses[:100], nonmember_losses[:100]])
ground_truth = np.concatenate([
    np.ones(100, dtype=bool),
    np.zeros(100, dtype=bool),
])
 
# Simple loss-based attack
simple_preds, threshold = loss_based_membership_inference(
    target_losses, nonmember_losses[100:], threshold_percentile=50
)
simple_eval = evaluate_membership_inference(simple_preds, ground_truth)
 
# Calibrated attack
membership_probs = calibrated_membership_inference(
    target_losses, shadow_member, shadow_nonmember
)
calibrated_preds = membership_probs > 0.5
calibrated_eval = evaluate_membership_inference(calibrated_preds, ground_truth)
 
print("Loss-based MI attack:")
print(f"  Accuracy: {simple_eval['accuracy']:.1%}")
print(f"  Advantage: {simple_eval['advantage']:.3f}")
 
print("\nCalibrated MI attack:")
print(f"  Accuracy: {calibrated_eval['accuracy']:.1%}")
print(f"  Advantage: {calibrated_eval['advantage']:.3f}")

Data Provenance 攻擊

Manipulating Attribution Records

Data provenance systems track the lineage of 訓練資料 — where it came from, how it was processed, and who was responsible 對每個 step. These systems are increasingly important for regulatory compliance (GDPR right to explanation, EU AI Act data governance). 攻擊者 who can manipulate provenance records can obscure the source of poisoned data or falsely attribute data to legitimate sources.

"""
Data provenance integrity verification.
Implements a tamper-evident provenance chain for 訓練資料.
"""
import hashlib
import json
import time
from dataclasses import dataclass, field
 
 
@dataclass
class ProvenanceRecord:
    """A single record in a data provenance chain."""
    record_id: str
    data_hash: str
    source: str
    transformation: str
    timestamp: float
    previous_hash: str
    actor: str
 
 
def create_provenance_chain(
    records: list[dict],
) -> list[ProvenanceRecord]:
    """
    Create a tamper-evident provenance chain (similar to a blockchain).
    Each record contains the hash of the previous record, making
    it impossible to modify earlier records without 偵測.
    """
    chain = []
    previous_hash = "genesis"
 
    for record_data in records:
        record_content = json.dumps(record_data, sort_keys=True)
        data_hash = hashlib.sha256(record_content.encode()).hexdigest()
 
        chain_input = f"{data_hash}:{previous_hash}:{record_data.get('timestamp', 0)}"
        record_id = hashlib.sha256(chain_input.encode()).hexdigest()[:16]
 
        record = ProvenanceRecord(
            record_id=record_id,
            data_hash=data_hash,
            source=record_data.get("source", "unknown"),
            transformation=record_data.get("transformation", "none"),
            timestamp=record_data.get("timestamp", time.time()),
            previous_hash=previous_hash,
            actor=record_data.get("actor", "system"),
        )
 
        chain.append(record)
        previous_hash = record_id
 
    return chain
 
 
def verify_provenance_chain(
    chain: list[ProvenanceRecord],
) -> dict:
    """Verify the integrity of a provenance chain."""
    issues = []
 
    for i, record in enumerate(chain):
        # Verify chain linkage
        if i == 0:
            if record.previous_hash != "genesis":
                issues.append(f"Record {i}: invalid genesis (expected 'genesis')")
        else:
            if record.previous_hash != chain[i - 1].record_id:
                issues.append(
                    f"Record {i}: broken chain link "
                    f"(expected {chain[i - 1].record_id}, got {record.previous_hash})"
                )
 
        # Verify temporal ordering
        if i > 0 and record.timestamp < chain[i - 1].timestamp:
            issues.append(f"Record {i}: timestamp out of order")
 
    return {
        "chain_length": len(chain),
        "is_valid": len(issues) == 0,
        "issues": issues,
    }
 
 
# Demonstration
records = [
    {"source": "wikipedia", "transformation": "crawl", "timestamp": 1000.0, "actor": "crawler"},
    {"source": "wikipedia", "transformation": "dedup", "timestamp": 1001.0, "actor": "pipeline"},
    {"source": "wikipedia", "transformation": "filter", "timestamp": 1002.0, "actor": "pipeline"},
    {"source": "wikipedia", "transformation": "tokenize", "timestamp": 1003.0, "actor": "pipeline"},
]
 
chain = create_provenance_chain(records)
valid_result = verify_provenance_chain(chain)
print(f"Valid chain: {valid_result['is_valid']}")
 
# Tamper with the chain (modify an intermediate record)
tampered_chain = list(chain)
tampered_chain[1] = ProvenanceRecord(
    record_id="tampered_id",
    data_hash="fake_hash",
    source="malicious_source",
    transformation="poison",
    timestamp=1001.0,
    previous_hash=chain[0].record_id,
    actor="攻擊者",
)
tampered_result = verify_provenance_chain(tampered_chain)
print(f"Tampered chain: {tampered_result['is_valid']}")
print(f"Issues: {tampered_result['issues']}")

Defensive Strategies

Building Robust Attribution Systems

Multi-method attribution: Do not rely on a single attribution method. Combine influence functions, TracIn, TRAK, and representation-based methods. Poisoned examples that evade one method may be detected by another.
Continuous attribution 監控: Compute attributions periodically during 訓練, not just after 訓練 completes. This reduces the window for timing-based evasion attacks.
對抗性 attribution 測試: Red team the attribution system itself by crafting examples designed to evade 偵測 and measuring the 偵測 rate.
Differential privacy as 防禦: Training with differential privacy bounds the influence of any single 訓練 example, limiting both membership 推論 success and the impact of 資料投毒. The privacy-utility tradeoff must be carefully calibrated.
Provenance chain integrity: 實作 tamper-evident provenance tracking for all 訓練資料, with cryptographic verification at each processing stage.

參考文獻

Koh, P. W., & Liang, P. (2017). "理解 Black-Box Predictions via Influence Functions." ICML 2017.
Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX 安全 Symposium 2021.
Carlini, N., et al. (2022). "Membership Inference 攻擊 From First Principles." IEEE S&P 2022.
Park, S., et al. (2023). "TRAK: Attributing Model Behavior at Scale." ICML 2023.
Pruthi, G., et al. (2020). "Estimating Training Data Influence by Tracing Gradient Descent." NeurIPS 2020.

安全 of 訓練 Data Attribution Methods

Related articles

安全 of 訓練 Data Attribution Methods

Related articles