安全 of 訓練 Data Attribution Methods
Analysis of vulnerabilities in training data attribution techniques including influence functions, membership inference, and data provenance tracking, with implications for privacy and security.
概覽
訓練資料 attribution methods attempt to answer a critical question: which 訓練 examples most influenced a model's behavior on a given 輸入? This question has 安全 implications in both directions. Attribution methods are used defensively to trace model failures back to problematic 訓練資料, detect 資料投毒, and comply with data governance regulations. But attribution methods are also an 攻擊面: an adversary can 利用 attribution systems to evade 資料投毒 偵測, conduct membership 推論 attacks to determine whether specific data was used in 訓練, or manipulate provenance records to obscure the origins of 訓練資料.
The foundational work on influence functions by Koh and Liang (2017) in "理解 Black-Box Predictions via Influence Functions" established the mathematical framework for attributing model predictions to 訓練 examples. Their approach uses the implicit Hessian to estimate how removing a single 訓練 example would change 模型's loss on a 測試 point. While powerful, this approach has known limitations that create 安全 漏洞: it relies on convexity assumptions that do not hold for deep networks, it is computationally expensive (making comprehensive auditing difficult), and it can be manipulated by adversarially crafted 訓練 examples.
Carlini et al. (2021) demonstrated in "Extracting Training Data from Large Language Models" that the connection between 訓練資料 and model outputs is both more direct and more exploitable than previously understood. Their work on 訓練資料 extraction showed that membership 推論 — determining whether a specific example was in the 訓練 set — is practical against 大型語言模型, raising both privacy and 安全 concerns.
Training Data Attribution Methods
Influence Functions
Influence functions estimate the effect of a single 訓練 example on 模型's prediction by computing a first-order approximation of the leave-one-out retraining procedure.
"""
Influence function computation and 安全 analysis.
Demonstrates the computation and limitations of influence functions
for 訓練資料 attribution.
"""
import numpy as np
from dataclasses import dataclass
from typing import Optional
@dataclass
class InfluenceResult:
"""Result of influence function computation."""
test_index: int
training_index: int
influence_score: float
is_helpful: bool # Positive influence = helpful
computation_method: str
confidence: float
def compute_influence_approximation(
training_gradients: np.ndarray,
test_gradient: np.ndarray,
hessian_inverse_approx: np.ndarray,
) -> np.ndarray:
"""
Compute influence function approximation.
The influence of 訓練 example z_i on 測試 loss at z_test is:
I(z_i, z_test) = -grad_test^T @ H^{-1} @ grad_i
where H is the Hessian of the 訓練 loss and grad_i, grad_test
are the per-example gradients.
Args:
training_gradients: Shape (num_train, param_dim).
test_gradient: Shape (param_dim,).
hessian_inverse_approx: Shape (param_dim, param_dim).
Approximation of the inverse Hessian.
Returns:
Influence scores 對每個 訓練 example.
"""
# ihvp = H^{-1} @ grad_test (inverse Hessian-vector product)
ihvp = hessian_inverse_approx @ test_gradient
# Influence = -grad_train_i^T @ ihvp 對每個 訓練 example
influences = -training_gradients @ ihvp
return influences
def stochastic_hessian_inverse(
training_gradients: np.ndarray,
damping: float = 0.01,
num_iterations: int = 100,
scale: float = 1.0,
) -> np.ndarray:
"""
Compute an approximation of the inverse Hessian using
the Neumann series approach.
H^{-1} ≈ (1/damping) * sum_{j=0}^{J} (I - H/damping)^j
這是 cheaper than direct inversion but introduces
approximation error that affects attribution accuracy.
"""
n, d = training_gradients.shape
# Approximate H using the outer product of gradients
# H ≈ (1/n) * sum_i grad_i @ grad_i^T + damping * I
H_approx = (training_gradients.T @ training_gradients) / n
H_approx += damping * np.eye(d)
# Direct inversion for small dimensions (demonstration)
# In practice, use iterative methods for large models
try:
H_inv = np.linalg.inv(H_approx) * scale
except np.linalg.LinAlgError:
H_inv = np.linalg.pinv(H_approx) * scale
return H_inv
# Demonstration
np.random.seed(42)
num_train, param_dim = 100, 32
# Simulate per-example gradients
train_grads = np.random.randn(num_train, param_dim) * 0.01
test_grad = np.random.randn(param_dim) * 0.01
# Compute influence scores
H_inv = stochastic_hessian_inverse(train_grads, damping=0.01)
influences = compute_influence_approximation(train_grads, test_grad, H_inv)
# Report top influential 訓練 examples
top_helpful = np.argsort(influences)[:5] # Most negative = most helpful
top_harmful = np.argsort(influences)[-5:][::-1] # Most positive = most harmful
print("Most helpful 訓練 examples (reduce 測試 loss):")
for idx in top_helpful:
print(f" Training example {idx}: influence = {influences[idx]:.6f}")
print("\nMost harmful 訓練 examples (increase 測試 loss):")
for idx in top_harmful:
print(f" Training example {idx}: influence = {influences[idx]:.6f}")Limitations That Create 安全 漏洞
Influence functions have several known limitations that create exploitable 安全 漏洞:
-
Non-convexity: Influence functions assume a convex loss landscape. Deep networks are highly non-convex, making the Hessian approximation unreliable. 攻擊者 can craft 訓練 examples that appear benign to influence function analysis but have significant actual impact.
-
Computational cost: Computing influence for all 訓練 examples is expensive (quadratic in model parameters). In practice, organizations use approximations that reduce accuracy, creating blind spots 攻擊者 can 利用.
-
Stale attributions: Influence functions are computed at a specific model checkpoint. As 訓練 continues, the attributions become stale. 攻擊者 can time their 投毒 to 利用 the gap between attribution computations.
"""
Exploiting influence function limitations.
Demonstrates how 攻擊者 can craft 訓練 examples that
evade influence function-based 偵測.
"""
import numpy as np
def craft_influence_evading_poison(
clean_gradients: np.ndarray,
target_gradient: np.ndarray,
hessian_inverse: np.ndarray,
detection_threshold: float = 0.1,
max_attempts: int = 100,
seed: int = 42,
) -> Optional[np.ndarray]:
"""
Craft a poisoned 訓練 example whose gradient evades
influence function 偵測 while still having the desired
effect on model behavior.
Strategy: decompose the poisoned gradient into a component
that is detected by influence functions and a component that
is not (lies in the null space of H^{-1} @ test_gradient).
Args:
clean_gradients: Gradients of clean 訓練 examples.
target_gradient: The gradient direction we want to inject.
hessian_inverse: Approximate inverse Hessian.
detection_threshold: Maximum influence score to avoid 偵測.
Returns:
A poisoned gradient that evades influence 偵測, or None
if evasion is not possible within the constraints.
"""
rng = np.random.default_rng(seed)
# The 偵測 direction: H^{-1} @ test_grad
# Influence = -poison_grad^T @ H^{-1} @ test_grad
# To evade: make poison_grad orthogonal to H^{-1} @ test_grad
# while keeping it aligned with target_gradient
test_grad_mean = clean_gradients.mean(axis=0)
detection_direction = hessian_inverse @ test_grad_mean
detection_direction /= np.linalg.norm(detection_direction) + 1e-10
# Project target gradient to remove the detectable component
target_norm = target_gradient / (np.linalg.norm(target_gradient) + 1e-10)
detectable_component = np.dot(target_norm, detection_direction) * detection_direction
evasive_gradient = target_norm - detectable_component
# Verify evasion
if np.linalg.norm(evasive_gradient) < 1e-10:
return None # Cannot evade (target is aligned with 偵測 direction)
evasive_gradient /= np.linalg.norm(evasive_gradient)
# Scale to have similar magnitude to clean gradients
clean_norm = np.mean([np.linalg.norm(g) for g in clean_gradients])
evasive_gradient *= clean_norm
# Verify influence score is below threshold
influence = -evasive_gradient @ hessian_inverse @ test_grad_mean
if abs(influence) < detection_threshold:
return evasive_gradient
return None
# Demonstration
np.random.seed(42)
n, d = 100, 32
clean_grads = np.random.randn(n, d) * 0.01
target_grad = np.random.randn(d) * 0.01 # What we want to inject
H_inv = stochastic_hessian_inverse(clean_grads, damping=0.01)
evasive = craft_influence_evading_poison(
clean_grads, target_grad, H_inv, detection_threshold=0.001
)
if evasive is not None:
# Check influence score
test_grad = clean_grads.mean(axis=0)
influence = -evasive @ H_inv @ test_grad
對齊 = np.dot(
evasive / np.linalg.norm(evasive),
target_grad / np.linalg.norm(target_grad)
)
print(f"Evasive gradient crafted:")
print(f" Influence score: {influence:.6f} (below threshold)")
print(f" Alignment with target: {對齊:.3f}")
print(f" Norm: {np.linalg.norm(evasive):.4f} (matches clean: {np.mean([np.linalg.norm(g) for g in clean_grads]):.4f})")
else:
print("Could not craft an evasive gradient (target aligned with 偵測)")Membership Inference 攻擊
Exploiting Attribution for Privacy 攻擊
Membership 推論 asks: "Was a specific data point used to train this model?" 這是 both a privacy attack and a 安全 diagnostic. For 紅隊演練, membership 推論 can reveal what data an organization used to train their model, potentially exposing confidential datasets or regulatory violations.
"""
Membership 推論 attack 實作.
Tests whether specific examples were used in model 訓練
based on model behavior differences.
"""
import numpy as np
from dataclasses import dataclass
from typing import Optional
@dataclass
class MembershipInferenceResult:
"""Result of a membership 推論 attack."""
example_id: str
predicted_member: bool
confidence: float
loss_value: float
loss_threshold: float
method: str
def loss_based_membership_inference(
target_losses: np.ndarray,
reference_losses: np.ndarray,
threshold_percentile: float = 50.0,
) -> tuple[np.ndarray, float]:
"""
Simple loss-based membership 推論.
The key insight: 訓練 examples tend to have lower loss
than non-訓練 examples 因為 模型 has been optimized
to reduce loss on the 訓練 set.
Args:
target_losses: Per-example losses for examples to classify.
reference_losses: Losses for known non-member examples.
threshold_percentile: Percentile of reference losses to use
as the decision threshold.
Returns:
(predictions, threshold) — boolean predictions and the threshold used.
"""
threshold = np.percentile(reference_losses, threshold_percentile)
predictions = target_losses < threshold # Low loss = likely member
return predictions, float(threshold)
def calibrated_membership_inference(
target_losses: np.ndarray,
shadow_member_losses: np.ndarray,
shadow_nonmember_losses: np.ndarray,
) -> np.ndarray:
"""
Calibrated membership 推論 using shadow model statistics.
Uses the distribution of losses from a shadow model (trained
on similar data) to calibrate the membership decision.
Members and non-members have different loss distributions,
and the shadow model helps estimate these distributions.
Based on methodology from Carlini et al. 2022,
"Membership Inference 攻擊 From First Principles."
"""
# Fit Gaussian distributions for member and non-member losses
member_mean = np.mean(shadow_member_losses)
member_std = np.std(shadow_member_losses) + 1e-10
nonmember_mean = np.mean(shadow_nonmember_losses)
nonmember_std = np.std(shadow_nonmember_losses) + 1e-10
# Compute likelihood ratio 對每個 target example
member_likelihood = np.exp(
-0.5 * ((target_losses - member_mean) / member_std) ** 2
) / member_std
nonmember_likelihood = np.exp(
-0.5 * ((target_losses - nonmember_mean) / nonmember_std) ** 2
) / nonmember_std
# Membership probability via Bayes rule (assuming 50/50 prior)
membership_prob = member_likelihood / (
member_likelihood + nonmember_likelihood + 1e-10
)
return membership_prob
def evaluate_membership_inference(
predictions: np.ndarray,
ground_truth: np.ndarray,
) -> dict:
"""評估 membership 推論 attack accuracy."""
tp = np.sum(predictions & ground_truth)
fp = np.sum(predictions & ~ground_truth)
fn = np.sum(~predictions & ground_truth)
tn = np.sum(~predictions & ~ground_truth)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
accuracy = (tp + tn) / len(predictions) if len(predictions) > 0 else 0.0
# True positive rate at low false positive rate (安全-relevant metric)
# In the binary case, 這是 just a proxy
fpr = fp / (fp + tn) if (fp + tn) > 0 else 0.0
tpr = tp / (tp + fn) if (tp + fn) > 0 else 0.0
return {
"accuracy": float(accuracy),
"precision": float(precision),
"recall": float(recall),
"true_positive_rate": float(tpr),
"false_positive_rate": float(fpr),
"advantage": float(tpr - fpr), # TPR - FPR (random = 0)
}
# Demonstration
np.random.seed(42)
# Simulate losses: members have lower loss than non-members
member_losses = np.random.normal(1.5, 0.3, 200) # 訓練資料
nonmember_losses = np.random.normal(2.0, 0.4, 200) # Not in 訓練
# Shadow model losses for calibration
shadow_member = np.random.normal(1.6, 0.35, 500)
shadow_nonmember = np.random.normal(2.1, 0.45, 500)
# Mix members and non-members as target
target_losses = np.concatenate([member_losses[:100], nonmember_losses[:100]])
ground_truth = np.concatenate([
np.ones(100, dtype=bool),
np.zeros(100, dtype=bool),
])
# Simple loss-based attack
simple_preds, threshold = loss_based_membership_inference(
target_losses, nonmember_losses[100:], threshold_percentile=50
)
simple_eval = evaluate_membership_inference(simple_preds, ground_truth)
# Calibrated attack
membership_probs = calibrated_membership_inference(
target_losses, shadow_member, shadow_nonmember
)
calibrated_preds = membership_probs > 0.5
calibrated_eval = evaluate_membership_inference(calibrated_preds, ground_truth)
print("Loss-based MI attack:")
print(f" Accuracy: {simple_eval['accuracy']:.1%}")
print(f" Advantage: {simple_eval['advantage']:.3f}")
print("\nCalibrated MI attack:")
print(f" Accuracy: {calibrated_eval['accuracy']:.1%}")
print(f" Advantage: {calibrated_eval['advantage']:.3f}")Data Provenance 攻擊
Manipulating Attribution Records
Data provenance systems track the lineage of 訓練資料 — where it came from, how it was processed, and who was responsible 對每個 step. These systems are increasingly important for regulatory compliance (GDPR right to explanation, EU AI Act data governance). 攻擊者 who can manipulate provenance records can obscure the source of poisoned data or falsely attribute data to legitimate sources.
"""
Data provenance integrity verification.
Implements a tamper-evident provenance chain for 訓練資料.
"""
import hashlib
import json
import time
from dataclasses import dataclass, field
@dataclass
class ProvenanceRecord:
"""A single record in a data provenance chain."""
record_id: str
data_hash: str
source: str
transformation: str
timestamp: float
previous_hash: str
actor: str
def create_provenance_chain(
records: list[dict],
) -> list[ProvenanceRecord]:
"""
Create a tamper-evident provenance chain (similar to a blockchain).
Each record contains the hash of the previous record, making
it impossible to modify earlier records without 偵測.
"""
chain = []
previous_hash = "genesis"
for record_data in records:
record_content = json.dumps(record_data, sort_keys=True)
data_hash = hashlib.sha256(record_content.encode()).hexdigest()
chain_input = f"{data_hash}:{previous_hash}:{record_data.get('timestamp', 0)}"
record_id = hashlib.sha256(chain_input.encode()).hexdigest()[:16]
record = ProvenanceRecord(
record_id=record_id,
data_hash=data_hash,
source=record_data.get("source", "unknown"),
transformation=record_data.get("transformation", "none"),
timestamp=record_data.get("timestamp", time.time()),
previous_hash=previous_hash,
actor=record_data.get("actor", "system"),
)
chain.append(record)
previous_hash = record_id
return chain
def verify_provenance_chain(
chain: list[ProvenanceRecord],
) -> dict:
"""Verify the integrity of a provenance chain."""
issues = []
for i, record in enumerate(chain):
# Verify chain linkage
if i == 0:
if record.previous_hash != "genesis":
issues.append(f"Record {i}: invalid genesis (expected 'genesis')")
else:
if record.previous_hash != chain[i - 1].record_id:
issues.append(
f"Record {i}: broken chain link "
f"(expected {chain[i - 1].record_id}, got {record.previous_hash})"
)
# Verify temporal ordering
if i > 0 and record.timestamp < chain[i - 1].timestamp:
issues.append(f"Record {i}: timestamp out of order")
return {
"chain_length": len(chain),
"is_valid": len(issues) == 0,
"issues": issues,
}
# Demonstration
records = [
{"source": "wikipedia", "transformation": "crawl", "timestamp": 1000.0, "actor": "crawler"},
{"source": "wikipedia", "transformation": "dedup", "timestamp": 1001.0, "actor": "pipeline"},
{"source": "wikipedia", "transformation": "filter", "timestamp": 1002.0, "actor": "pipeline"},
{"source": "wikipedia", "transformation": "tokenize", "timestamp": 1003.0, "actor": "pipeline"},
]
chain = create_provenance_chain(records)
valid_result = verify_provenance_chain(chain)
print(f"Valid chain: {valid_result['is_valid']}")
# Tamper with the chain (modify an intermediate record)
tampered_chain = list(chain)
tampered_chain[1] = ProvenanceRecord(
record_id="tampered_id",
data_hash="fake_hash",
source="malicious_source",
transformation="poison",
timestamp=1001.0,
previous_hash=chain[0].record_id,
actor="攻擊者",
)
tampered_result = verify_provenance_chain(tampered_chain)
print(f"Tampered chain: {tampered_result['is_valid']}")
print(f"Issues: {tampered_result['issues']}")Defensive Strategies
Building Robust Attribution Systems
-
Multi-method attribution: Do not rely on a single attribution method. Combine influence functions, TracIn, TRAK, and representation-based methods. Poisoned examples that evade one method may be detected by another.
-
Continuous attribution 監控: Compute attributions periodically during 訓練, not just after 訓練 completes. This reduces the window for timing-based evasion attacks.
-
對抗性 attribution 測試: Red team the attribution system itself by crafting examples designed to evade 偵測 and measuring the 偵測 rate.
-
Differential privacy as 防禦: Training with differential privacy bounds the influence of any single 訓練 example, limiting both membership 推論 success and the impact of 資料投毒. The privacy-utility tradeoff must be carefully calibrated.
-
Provenance chain integrity: 實作 tamper-evident provenance tracking for all 訓練資料, with cryptographic verification at each processing stage.
參考文獻
- Koh, P. W., & Liang, P. (2017). "理解 Black-Box Predictions via Influence Functions." ICML 2017.
- Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX 安全 Symposium 2021.
- Carlini, N., et al. (2022). "Membership Inference 攻擊 From First Principles." IEEE S&P 2022.
- Park, S., et al. (2023). "TRAK: Attributing Model Behavior at Scale." ICML 2023.
- Pruthi, G., et al. (2020). "Estimating Training Data Influence by Tracing Gradient Descent." NeurIPS 2020.