Security of Training Data Attribution Methods

advanced13 min readUpdated 2026-03-20

Analysis of vulnerabilities in training data attribution techniques including influence functions, membership inference, and data provenance tracking, with implications for privacy and security.

training-pipeline data-attribution influence-functions membership-inference privacy

Overview

Training data attribution methods attempt to answer a critical question: which training examples most influenced a model's behavior on a given input? This question has security implications in both directions. Attribution methods are used defensively to trace model failures back to problematic training data, detect data poisoning, and comply with data governance regulations. But attribution methods are also an attack surface: an adversary can exploit attribution systems to evade data poisoning detection, conduct membership inference attacks to determine whether specific data was used in training, or manipulate provenance records to obscure the origins of training data.

The foundational work on influence functions by Koh and Liang (2017) in "Understanding Black-Box Predictions via Influence Functions" established the mathematical framework for attributing model predictions to training examples. Their approach uses the implicit Hessian to estimate how removing a single training example would change the model's loss on a test point. While powerful, this approach has known limitations that create security vulnerabilities: it relies on convexity assumptions that do not hold for deep networks, it is computationally expensive (making comprehensive auditing difficult), and it can be manipulated by adversarially crafted training examples.

Carlini et al. (2021) demonstrated in "Extracting Training Data from Large Language Models" that the connection between training data and model outputs is both more direct and more exploitable than previously understood. Their work on training data extraction showed that membership inference — determining whether a specific example was in the training set — is practical against large language models, raising both privacy and security concerns.

Training Data Attribution Methods

Influence Functions

Influence functions estimate the effect of a single training example on the model's prediction by computing a first-order approximation of the leave-one-out retraining procedure.

"""
Influence function computation and security analysis.
Demonstrates the computation and limitations of influence functions
for training data attribution.
"""
import numpy as np
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class InfluenceResult:
    """Result of influence function computation."""
    test_index: int
    training_index: int
    influence_score: float
    is_helpful: bool  # Positive influence = helpful
    computation_method: str
    confidence: float
 
def compute_influence_approximation(
    training_gradients: np.ndarray,
    test_gradient: np.ndarray,
    hessian_inverse_approx: np.ndarray,
) -> np.ndarray:
    """
    Compute influence function approximation.
 
    The influence of training example z_i on test loss at z_test is:
        I(z_i, z_test) = -grad_test^T @ H^{-1} @ grad_i
 
    where H is the Hessian of the training loss and grad_i, grad_test
    are the per-example gradients.
 
    Args:
        training_gradients: Shape (num_train, param_dim).
        test_gradient: Shape (param_dim,).
        hessian_inverse_approx: Shape (param_dim, param_dim).
            Approximation of the inverse Hessian.
 
    Returns:
        Influence scores for each training example.
    """
    # ihvp = H^{-1} @ grad_test (inverse Hessian-vector product)
    ihvp = hessian_inverse_approx @ test_gradient
 
    # Influence = -grad_train_i^T @ ihvp for each training example
    influences = -training_gradients @ ihvp
 
    return influences
 
def stochastic_hessian_inverse(
    training_gradients: np.ndarray,
    damping: float = 0.01,
    num_iterations: int = 100,
    scale: float = 1.0,
) -> np.ndarray:
    """
    Compute an approximation of the inverse Hessian using
    the Neumann series approach.
 
    H^{-1} ≈ (1/damping) * sum_{j=0}^{J} (I - H/damping)^j
 
    This is cheaper than direct inversion but introduces
    approximation error that affects attribution accuracy.
    """
    n, d = training_gradients.shape
 
    # Approximate H using the outer product of gradients
    # H ≈ (1/n) * sum_i grad_i @ grad_i^T + damping * I
    H_approx = (training_gradients.T @ training_gradients) / n
    H_approx += damping * np.eye(d)
 
    # Direct inversion for small dimensions (demonstration)
    # In practice, use iterative methods for large models
    try:
        H_inv = np.linalg.inv(H_approx) * scale
    except np.linalg.LinAlgError:
        H_inv = np.linalg.pinv(H_approx) * scale
 
    return H_inv
 
# Demonstration
np.random.seed(42)
num_train, param_dim = 100, 32
 
# Simulate per-example gradients
train_grads = np.random.randn(num_train, param_dim) * 0.01
test_grad = np.random.randn(param_dim) * 0.01
 
# Compute influence scores
H_inv = stochastic_hessian_inverse(train_grads, damping=0.01)
influences = compute_influence_approximation(train_grads, test_grad, H_inv)
 
# Report top influential training examples
top_helpful = np.argsort(influences)[:5]  # Most negative = most helpful
top_harmful = np.argsort(influences)[-5:][::-1]  # Most positive = most harmful
 
print("Most helpful training examples (reduce test loss):")
for idx in top_helpful:
    print(f"  Training example {idx}: influence = {influences[idx]:.6f}")
 
print("\nMost harmful training examples (increase test loss):")
for idx in top_harmful:
    print(f"  Training example {idx}: influence = {influences[idx]:.6f}")

Limitations That Create Security Vulnerabilities

Influence functions have several known limitations that create exploitable security vulnerabilities:

Non-convexity: Influence functions assume a convex loss landscape. Deep networks are highly non-convex, making the Hessian approximation unreliable. An attacker can craft training examples that appear benign to influence function analysis but have significant actual impact.
Computational cost: Computing influence for all training examples is expensive (quadratic in model parameters). In practice, organizations use approximations that reduce accuracy, creating blind spots an attacker can exploit.
Stale attributions: Influence functions are computed at a specific model checkpoint. As training continues, the attributions become stale. An attacker can time their poisoning to exploit the gap between attribution computations.

"""
Exploiting influence function limitations.
Demonstrates how an attacker can craft training examples that
evade influence function-based detection.
"""
import numpy as np
 
def craft_influence_evading_poison(
    clean_gradients: np.ndarray,
    target_gradient: np.ndarray,
    hessian_inverse: np.ndarray,
    detection_threshold: float = 0.1,
    max_attempts: int = 100,
    seed: int = 42,
) -> Optional[np.ndarray]:
    """
    Craft a poisoned training example whose gradient evades
    influence function detection while still having the desired
    effect on model behavior.
 
    Strategy: decompose the poisoned gradient into a component
    that is detected by influence functions and a component that
    is not (lies in the null space of H^{-1} @ test_gradient).
 
    Args:
        clean_gradients: Gradients of clean training examples.
        target_gradient: The gradient direction we want to inject.
        hessian_inverse: Approximate inverse Hessian.
        detection_threshold: Maximum influence score to avoid detection.
 
    Returns:
        A poisoned gradient that evades influence detection, or None
        if evasion is not possible within the constraints.
    """
    rng = np.random.default_rng(seed)
 
    # The detection direction: H^{-1} @ test_grad
    # Influence = -poison_grad^T @ H^{-1} @ test_grad
    # To evade: make poison_grad orthogonal to H^{-1} @ test_grad
    # while keeping it aligned with target_gradient
 
    test_grad_mean = clean_gradients.mean(axis=0)
    detection_direction = hessian_inverse @ test_grad_mean
    detection_direction /= np.linalg.norm(detection_direction) + 1e-10
 
    # Project target gradient to remove the detectable component
    target_norm = target_gradient / (np.linalg.norm(target_gradient) + 1e-10)
    detectable_component = np.dot(target_norm, detection_direction) * detection_direction
    evasive_gradient = target_norm - detectable_component
 
    # Verify evasion
    if np.linalg.norm(evasive_gradient) < 1e-10:
        return None  # Cannot evade (target is aligned with detection direction)
 
    evasive_gradient /= np.linalg.norm(evasive_gradient)
 
    # Scale to have similar magnitude to clean gradients
    clean_norm = np.mean([np.linalg.norm(g) for g in clean_gradients])
    evasive_gradient *= clean_norm
 
    # Verify influence score is below threshold
    influence = -evasive_gradient @ hessian_inverse @ test_grad_mean
    if abs(influence) < detection_threshold:
        return evasive_gradient
 
    return None
 
# Demonstration
np.random.seed(42)
n, d = 100, 32
clean_grads = np.random.randn(n, d) * 0.01
target_grad = np.random.randn(d) * 0.01  # What we want to inject
 
H_inv = stochastic_hessian_inverse(clean_grads, damping=0.01)
 
evasive = craft_influence_evading_poison(
    clean_grads, target_grad, H_inv, detection_threshold=0.001
)
 
if evasive is not None:
    # Check influence score
    test_grad = clean_grads.mean(axis=0)
    influence = -evasive @ H_inv @ test_grad
    alignment = np.dot(
        evasive / np.linalg.norm(evasive),
        target_grad / np.linalg.norm(target_grad)
    )
    print(f"Evasive gradient crafted:")
    print(f"  Influence score: {influence:.6f} (below threshold)")
    print(f"  Alignment with target: {alignment:.3f}")
    print(f"  Norm: {np.linalg.norm(evasive):.4f} (matches clean: {np.mean([np.linalg.norm(g) for g in clean_grads]):.4f})")
else:
    print("Could not craft an evasive gradient (target aligned with detection)")

Membership Inference Attacks

Exploiting Attribution for Privacy Attacks

Membership inference asks: "Was a specific data point used to train this model?" This is both a privacy attack and a security diagnostic. For red teaming, membership inference can reveal what data an organization used to train their model, potentially exposing confidential datasets or regulatory violations.

"""
Membership inference attack implementation.
Tests whether specific examples were used in model training
based on model behavior differences.
"""
import numpy as np
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class MembershipInferenceResult:
    """Result of a membership inference attack."""
    example_id: str
    predicted_member: bool
    confidence: float
    loss_value: float
    loss_threshold: float
    method: str
 
def loss_based_membership_inference(
    target_losses: np.ndarray,
    reference_losses: np.ndarray,
    threshold_percentile: float = 50.0,
) -> tuple[np.ndarray, float]:
    """
    Simple loss-based membership inference.
 
    The key insight: training examples tend to have lower loss
    than non-training examples because the model has been optimized
    to reduce loss on the training set.
 
    Args:
        target_losses: Per-example losses for examples to classify.
        reference_losses: Losses for known non-member examples.
        threshold_percentile: Percentile of reference losses to use
            as the decision threshold.
 
    Returns:
        (predictions, threshold) — boolean predictions and the threshold used.
    """
    threshold = np.percentile(reference_losses, threshold_percentile)
    predictions = target_losses < threshold  # Low loss = likely member
    return predictions, float(threshold)
 
def calibrated_membership_inference(
    target_losses: np.ndarray,
    shadow_member_losses: np.ndarray,
    shadow_nonmember_losses: np.ndarray,
) -> np.ndarray:
    """
    Calibrated membership inference using shadow model statistics.
 
    Uses the distribution of losses from a shadow model (trained
    on similar data) to calibrate the membership decision.
    Members and non-members have different loss distributions,
    and the shadow model helps estimate these distributions.
 
    Based on methodology from Carlini et al. 2022,
    "Membership Inference Attacks From First Principles."
    """
    # Fit Gaussian distributions for member and non-member losses
    member_mean = np.mean(shadow_member_losses)
    member_std = np.std(shadow_member_losses) + 1e-10
    nonmember_mean = np.mean(shadow_nonmember_losses)
    nonmember_std = np.std(shadow_nonmember_losses) + 1e-10
 
    # Compute likelihood ratio for each target example
    member_likelihood = np.exp(
        -0.5 * ((target_losses - member_mean) / member_std) ** 2
    ) / member_std
    nonmember_likelihood = np.exp(
        -0.5 * ((target_losses - nonmember_mean) / nonmember_std) ** 2
    ) / nonmember_std
 
    # Membership probability via Bayes rule (assuming 50/50 prior)
    membership_prob = member_likelihood / (
        member_likelihood + nonmember_likelihood + 1e-10
    )
 
    return membership_prob
 
def evaluate_membership_inference(
    predictions: np.ndarray,
    ground_truth: np.ndarray,
) -> dict:
    """Evaluate membership inference attack accuracy."""
    tp = np.sum(predictions & ground_truth)
    fp = np.sum(predictions & ~ground_truth)
    fn = np.sum(~predictions & ground_truth)
    tn = np.sum(~predictions & ~ground_truth)
 
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    accuracy = (tp + tn) / len(predictions) if len(predictions) > 0 else 0.0
 
    # True positive rate at low false positive rate (security-relevant metric)
    # In the binary case, this is just a proxy
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0.0
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0.0
 
    return {
        "accuracy": float(accuracy),
        "precision": float(precision),
        "recall": float(recall),
        "true_positive_rate": float(tpr),
        "false_positive_rate": float(fpr),
        "advantage": float(tpr - fpr),  # TPR - FPR (random = 0)
    }
 
# Demonstration
np.random.seed(42)
 
# Simulate losses: members have lower loss than non-members
member_losses = np.random.normal(1.5, 0.3, 200)      # Training data
nonmember_losses = np.random.normal(2.0, 0.4, 200)    # Not in training
 
# Shadow model losses for calibration
shadow_member = np.random.normal(1.6, 0.35, 500)
shadow_nonmember = np.random.normal(2.1, 0.45, 500)
 
# Mix members and non-members as target
target_losses = np.concatenate([member_losses[:100], nonmember_losses[:100]])
ground_truth = np.concatenate([
    np.ones(100, dtype=bool),
    np.zeros(100, dtype=bool),
])
 
# Simple loss-based attack
simple_preds, threshold = loss_based_membership_inference(
    target_losses, nonmember_losses[100:], threshold_percentile=50
)
simple_eval = evaluate_membership_inference(simple_preds, ground_truth)
 
# Calibrated attack
membership_probs = calibrated_membership_inference(
    target_losses, shadow_member, shadow_nonmember
)
calibrated_preds = membership_probs > 0.5
calibrated_eval = evaluate_membership_inference(calibrated_preds, ground_truth)
 
print("Loss-based MI attack:")
print(f"  Accuracy: {simple_eval['accuracy']:.1%}")
print(f"  Advantage: {simple_eval['advantage']:.3f}")
 
print("\nCalibrated MI attack:")
print(f"  Accuracy: {calibrated_eval['accuracy']:.1%}")
print(f"  Advantage: {calibrated_eval['advantage']:.3f}")

Data Provenance Attacks

Manipulating Attribution Records

Data provenance systems track the lineage of training data — where it came from, how it was processed, and who was responsible for each step. These systems are increasingly important for regulatory compliance (GDPR right to explanation, EU AI Act data governance). An attacker who can manipulate provenance records can obscure the source of poisoned data or falsely attribute data to legitimate sources.

"""
Data provenance integrity verification.
Implements a tamper-evident provenance chain for training data.
"""
import hashlib
import json
import time
from dataclasses import dataclass, field
 
@dataclass
class ProvenanceRecord:
    """A single record in a data provenance chain."""
    record_id: str
    data_hash: str
    source: str
    transformation: str
    timestamp: float
    previous_hash: str
    actor: str
 
def create_provenance_chain(
    records: list[dict],
) -> list[ProvenanceRecord]:
    """
    Create a tamper-evident provenance chain (similar to a blockchain).
    Each record contains the hash of the previous record, making
    it impossible to modify earlier records without detection.
    """
    chain = []
    previous_hash = "genesis"
 
    for record_data in records:
        record_content = json.dumps(record_data, sort_keys=True)
        data_hash = hashlib.sha256(record_content.encode()).hexdigest()
 
        chain_input = f"{data_hash}:{previous_hash}:{record_data.get('timestamp', 0)}"
        record_id = hashlib.sha256(chain_input.encode()).hexdigest()[:16]
 
        record = ProvenanceRecord(
            record_id=record_id,
            data_hash=data_hash,
            source=record_data.get("source", "unknown"),
            transformation=record_data.get("transformation", "none"),
            timestamp=record_data.get("timestamp", time.time()),
            previous_hash=previous_hash,
            actor=record_data.get("actor", "system"),
        )
 
        chain.append(record)
        previous_hash = record_id
 
    return chain
 
def verify_provenance_chain(
    chain: list[ProvenanceRecord],
) -> dict:
    """Verify the integrity of a provenance chain."""
    issues = []
 
    for i, record in enumerate(chain):
        # Verify chain linkage
        if i == 0:
            if record.previous_hash != "genesis":
                issues.append(f"Record {i}: invalid genesis (expected 'genesis')")
        else:
            if record.previous_hash != chain[i - 1].record_id:
                issues.append(
                    f"Record {i}: broken chain link "
                    f"(expected {chain[i - 1].record_id}, got {record.previous_hash})"
                )
 
        # Verify temporal ordering
        if i > 0 and record.timestamp < chain[i - 1].timestamp:
            issues.append(f"Record {i}: timestamp out of order")
 
    return {
        "chain_length": len(chain),
        "is_valid": len(issues) == 0,
        "issues": issues,
    }
 
# Demonstration
records = [
    {"source": "wikipedia", "transformation": "crawl", "timestamp": 1000.0, "actor": "crawler"},
    {"source": "wikipedia", "transformation": "dedup", "timestamp": 1001.0, "actor": "pipeline"},
    {"source": "wikipedia", "transformation": "filter", "timestamp": 1002.0, "actor": "pipeline"},
    {"source": "wikipedia", "transformation": "tokenize", "timestamp": 1003.0, "actor": "pipeline"},
]
 
chain = create_provenance_chain(records)
valid_result = verify_provenance_chain(chain)
print(f"Valid chain: {valid_result['is_valid']}")
 
# Tamper with the chain (modify an intermediate record)
tampered_chain = list(chain)
tampered_chain[1] = ProvenanceRecord(
    record_id="tampered_id",
    data_hash="fake_hash",
    source="malicious_source",
    transformation="poison",
    timestamp=1001.0,
    previous_hash=chain[0].record_id,
    actor="attacker",
)
tampered_result = verify_provenance_chain(tampered_chain)
print(f"Tampered chain: {tampered_result['is_valid']}")
print(f"Issues: {tampered_result['issues']}")

Defensive Strategies

Building Robust Attribution Systems

Multi-method attribution: Do not rely on a single attribution method. Combine influence functions, TracIn, TRAK, and representation-based methods. Poisoned examples that evade one method may be detected by another.
Continuous attribution monitoring: Compute attributions periodically during training, not just after training completes. This reduces the window for timing-based evasion attacks.
Adversarial attribution testing: Red team the attribution system itself by crafting examples designed to evade detection and measuring the detection rate.
Differential privacy as defense: Training with differential privacy bounds the influence of any single training example, limiting both membership inference success and the impact of data poisoning. The privacy-utility tradeoff must be carefully calibrated.
Provenance chain integrity: Implement tamper-evident provenance tracking for all training data, with cryptographic verification at each processing stage.

References

Koh, P. W., & Liang, P. (2017). "Understanding Black-Box Predictions via Influence Functions." ICML 2017.
Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium 2021.
Carlini, N., et al. (2022). "Membership Inference Attacks From First Principles." IEEE S&P 2022.
Park, S., et al. (2023). "TRAK: Attributing Model Behavior at Scale." ICML 2023.
Pruthi, G., et al. (2020). "Estimating Training Data Influence by Tracing Gradient Descent." NeurIPS 2020.

Edit this page on GitHub

Security of Training Data Attribution Methods

advanced13 min readUpdated 2026-03-20

Analysis of vulnerabilities in training data attribution techniques including influence functions, membership inference, and data provenance tracking, with implications for privacy and security.

training-pipeline data-attribution influence-functions membership-inference privacy

"""
Influence function computation and security analysis.
Demonstrates the computation and limitations of influence functions
for training data attribution.
"""
import numpy as np
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class InfluenceResult:
    """Result of influence function computation."""
    test_index: int
    training_index: int
    influence_score: float
    is_helpful: bool  # Positive influence = helpful
    computation_method: str
    confidence: float
 
def compute_influence_approximation(
    training_gradients: np.ndarray,
    test_gradient: np.ndarray,
    hessian_inverse_approx: np.ndarray,
) -> np.ndarray:
    """
    Compute influence function approximation.
 
    The influence of training example z_i on test loss at z_test is:
        I(z_i, z_test) = -grad_test^T @ H^{-1} @ grad_i
 
    where H is the Hessian of the training loss and grad_i, grad_test
    are the per-example gradients.
 
    Args:
        training_gradients: Shape (num_train, param_dim).
        test_gradient: Shape (param_dim,).
        hessian_inverse_approx: Shape (param_dim, param_dim).
            Approximation of the inverse Hessian.
 
    Returns:
        Influence scores for each training example.
    """
    # ihvp = H^{-1} @ grad_test (inverse Hessian-vector product)
    ihvp = hessian_inverse_approx @ test_gradient
 
    # Influence = -grad_train_i^T @ ihvp for each training example
    influences = -training_gradients @ ihvp
 
    return influences
 
def stochastic_hessian_inverse(
    training_gradients: np.ndarray,
    damping: float = 0.01,
    num_iterations: int = 100,
    scale: float = 1.0,
) -> np.ndarray:
    """
    Compute an approximation of the inverse Hessian using
    the Neumann series approach.
 
    H^{-1} ≈ (1/damping) * sum_{j=0}^{J} (I - H/damping)^j
 
    This is cheaper than direct inversion but introduces
    approximation error that affects attribution accuracy.
    """
    n, d = training_gradients.shape
 
    # Approximate H using the outer product of gradients
    # H ≈ (1/n) * sum_i grad_i @ grad_i^T + damping * I
    H_approx = (training_gradients.T @ training_gradients) / n
    H_approx += damping * np.eye(d)
 
    # Direct inversion for small dimensions (demonstration)
    # In practice, use iterative methods for large models
    try:
        H_inv = np.linalg.inv(H_approx) * scale
    except np.linalg.LinAlgError:
        H_inv = np.linalg.pinv(H_approx) * scale
 
    return H_inv
 
# Demonstration
np.random.seed(42)
num_train, param_dim = 100, 32
 
# Simulate per-example gradients
train_grads = np.random.randn(num_train, param_dim) * 0.01
test_grad = np.random.randn(param_dim) * 0.01
 
# Compute influence scores
H_inv = stochastic_hessian_inverse(train_grads, damping=0.01)
influences = compute_influence_approximation(train_grads, test_grad, H_inv)
 
# Report top influential training examples
top_helpful = np.argsort(influences)[:5]  # Most negative = most helpful
top_harmful = np.argsort(influences)[-5:][::-1]  # Most positive = most harmful
 
print("Most helpful training examples (reduce test loss):")
for idx in top_helpful:
    print(f"  Training example {idx}: influence = {influences[idx]:.6f}")
 
print("\nMost harmful training examples (increase test loss):")
for idx in top_harmful:
    print(f"  Training example {idx}: influence = {influences[idx]:.6f}")

Limitations That Create Security Vulnerabilities

Influence functions have several known limitations that create exploitable security vulnerabilities:

Non-convexity: Influence functions assume a convex loss landscape. Deep networks are highly non-convex, making the Hessian approximation unreliable. An attacker can craft training examples that appear benign to influence function analysis but have significant actual impact.
Computational cost: Computing influence for all training examples is expensive (quadratic in model parameters). In practice, organizations use approximations that reduce accuracy, creating blind spots an attacker can exploit.
Stale attributions: Influence functions are computed at a specific model checkpoint. As training continues, the attributions become stale. An attacker can time their poisoning to exploit the gap between attribution computations.

"""
Exploiting influence function limitations.
Demonstrates how an attacker can craft training examples that
evade influence function-based detection.
"""
import numpy as np
 
def craft_influence_evading_poison(
    clean_gradients: np.ndarray,
    target_gradient: np.ndarray,
    hessian_inverse: np.ndarray,
    detection_threshold: float = 0.1,
    max_attempts: int = 100,
    seed: int = 42,
) -> Optional[np.ndarray]:
    """
    Craft a poisoned training example whose gradient evades
    influence function detection while still having the desired
    effect on model behavior.
 
    Strategy: decompose the poisoned gradient into a component
    that is detected by influence functions and a component that
    is not (lies in the null space of H^{-1} @ test_gradient).
 
    Args:
        clean_gradients: Gradients of clean training examples.
        target_gradient: The gradient direction we want to inject.
        hessian_inverse: Approximate inverse Hessian.
        detection_threshold: Maximum influence score to avoid detection.
 
    Returns:
        A poisoned gradient that evades influence detection, or None
        if evasion is not possible within the constraints.
    """
    rng = np.random.default_rng(seed)
 
    # The detection direction: H^{-1} @ test_grad
    # Influence = -poison_grad^T @ H^{-1} @ test_grad
    # To evade: make poison_grad orthogonal to H^{-1} @ test_grad
    # while keeping it aligned with target_gradient
 
    test_grad_mean = clean_gradients.mean(axis=0)
    detection_direction = hessian_inverse @ test_grad_mean
    detection_direction /= np.linalg.norm(detection_direction) + 1e-10
 
    # Project target gradient to remove the detectable component
    target_norm = target_gradient / (np.linalg.norm(target_gradient) + 1e-10)
    detectable_component = np.dot(target_norm, detection_direction) * detection_direction
    evasive_gradient = target_norm - detectable_component
 
    # Verify evasion
    if np.linalg.norm(evasive_gradient) < 1e-10:
        return None  # Cannot evade (target is aligned with detection direction)
 
    evasive_gradient /= np.linalg.norm(evasive_gradient)
 
    # Scale to have similar magnitude to clean gradients
    clean_norm = np.mean([np.linalg.norm(g) for g in clean_gradients])
    evasive_gradient *= clean_norm
 
    # Verify influence score is below threshold
    influence = -evasive_gradient @ hessian_inverse @ test_grad_mean
    if abs(influence) < detection_threshold:
        return evasive_gradient
 
    return None
 
# Demonstration
np.random.seed(42)
n, d = 100, 32
clean_grads = np.random.randn(n, d) * 0.01
target_grad = np.random.randn(d) * 0.01  # What we want to inject
 
H_inv = stochastic_hessian_inverse(clean_grads, damping=0.01)
 
evasive = craft_influence_evading_poison(
    clean_grads, target_grad, H_inv, detection_threshold=0.001
)
 
if evasive is not None:
    # Check influence score
    test_grad = clean_grads.mean(axis=0)
    influence = -evasive @ H_inv @ test_grad
    alignment = np.dot(
        evasive / np.linalg.norm(evasive),
        target_grad / np.linalg.norm(target_grad)
    )
    print(f"Evasive gradient crafted:")
    print(f"  Influence score: {influence:.6f} (below threshold)")
    print(f"  Alignment with target: {alignment:.3f}")
    print(f"  Norm: {np.linalg.norm(evasive):.4f} (matches clean: {np.mean([np.linalg.norm(g) for g in clean_grads]):.4f})")
else:
    print("Could not craft an evasive gradient (target aligned with detection)")

Membership Inference Attacks

Exploiting Attribution for Privacy Attacks

"""
Membership inference attack implementation.
Tests whether specific examples were used in model training
based on model behavior differences.
"""
import numpy as np
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class MembershipInferenceResult:
    """Result of a membership inference attack."""
    example_id: str
    predicted_member: bool
    confidence: float
    loss_value: float
    loss_threshold: float
    method: str
 
def loss_based_membership_inference(
    target_losses: np.ndarray,
    reference_losses: np.ndarray,
    threshold_percentile: float = 50.0,
) -> tuple[np.ndarray, float]:
    """
    Simple loss-based membership inference.
 
    The key insight: training examples tend to have lower loss
    than non-training examples because the model has been optimized
    to reduce loss on the training set.
 
    Args:
        target_losses: Per-example losses for examples to classify.
        reference_losses: Losses for known non-member examples.
        threshold_percentile: Percentile of reference losses to use
            as the decision threshold.
 
    Returns:
        (predictions, threshold) — boolean predictions and the threshold used.
    """
    threshold = np.percentile(reference_losses, threshold_percentile)
    predictions = target_losses < threshold  # Low loss = likely member
    return predictions, float(threshold)
 
def calibrated_membership_inference(
    target_losses: np.ndarray,
    shadow_member_losses: np.ndarray,
    shadow_nonmember_losses: np.ndarray,
) -> np.ndarray:
    """
    Calibrated membership inference using shadow model statistics.
 
    Uses the distribution of losses from a shadow model (trained
    on similar data) to calibrate the membership decision.
    Members and non-members have different loss distributions,
    and the shadow model helps estimate these distributions.
 
    Based on methodology from Carlini et al. 2022,
    "Membership Inference Attacks From First Principles."
    """
    # Fit Gaussian distributions for member and non-member losses
    member_mean = np.mean(shadow_member_losses)
    member_std = np.std(shadow_member_losses) + 1e-10
    nonmember_mean = np.mean(shadow_nonmember_losses)
    nonmember_std = np.std(shadow_nonmember_losses) + 1e-10
 
    # Compute likelihood ratio for each target example
    member_likelihood = np.exp(
        -0.5 * ((target_losses - member_mean) / member_std) ** 2
    ) / member_std
    nonmember_likelihood = np.exp(
        -0.5 * ((target_losses - nonmember_mean) / nonmember_std) ** 2
    ) / nonmember_std
 
    # Membership probability via Bayes rule (assuming 50/50 prior)
    membership_prob = member_likelihood / (
        member_likelihood + nonmember_likelihood + 1e-10
    )
 
    return membership_prob
 
def evaluate_membership_inference(
    predictions: np.ndarray,
    ground_truth: np.ndarray,
) -> dict:
    """Evaluate membership inference attack accuracy."""
    tp = np.sum(predictions & ground_truth)
    fp = np.sum(predictions & ~ground_truth)
    fn = np.sum(~predictions & ground_truth)
    tn = np.sum(~predictions & ~ground_truth)
 
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    accuracy = (tp + tn) / len(predictions) if len(predictions) > 0 else 0.0
 
    # True positive rate at low false positive rate (security-relevant metric)
    # In the binary case, this is just a proxy
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0.0
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0.0
 
    return {
        "accuracy": float(accuracy),
        "precision": float(precision),
        "recall": float(recall),
        "true_positive_rate": float(tpr),
        "false_positive_rate": float(fpr),
        "advantage": float(tpr - fpr),  # TPR - FPR (random = 0)
    }
 
# Demonstration
np.random.seed(42)
 
# Simulate losses: members have lower loss than non-members
member_losses = np.random.normal(1.5, 0.3, 200)      # Training data
nonmember_losses = np.random.normal(2.0, 0.4, 200)    # Not in training
 
# Shadow model losses for calibration
shadow_member = np.random.normal(1.6, 0.35, 500)
shadow_nonmember = np.random.normal(2.1, 0.45, 500)
 
# Mix members and non-members as target
target_losses = np.concatenate([member_losses[:100], nonmember_losses[:100]])
ground_truth = np.concatenate([
    np.ones(100, dtype=bool),
    np.zeros(100, dtype=bool),
])
 
# Simple loss-based attack
simple_preds, threshold = loss_based_membership_inference(
    target_losses, nonmember_losses[100:], threshold_percentile=50
)
simple_eval = evaluate_membership_inference(simple_preds, ground_truth)
 
# Calibrated attack
membership_probs = calibrated_membership_inference(
    target_losses, shadow_member, shadow_nonmember
)
calibrated_preds = membership_probs > 0.5
calibrated_eval = evaluate_membership_inference(calibrated_preds, ground_truth)
 
print("Loss-based MI attack:")
print(f"  Accuracy: {simple_eval['accuracy']:.1%}")
print(f"  Advantage: {simple_eval['advantage']:.3f}")
 
print("\nCalibrated MI attack:")
print(f"  Accuracy: {calibrated_eval['accuracy']:.1%}")
print(f"  Advantage: {calibrated_eval['advantage']:.3f}")

Data Provenance Attacks

Manipulating Attribution Records

"""
Data provenance integrity verification.
Implements a tamper-evident provenance chain for training data.
"""
import hashlib
import json
import time
from dataclasses import dataclass, field
 
@dataclass
class ProvenanceRecord:
    """A single record in a data provenance chain."""
    record_id: str
    data_hash: str
    source: str
    transformation: str
    timestamp: float
    previous_hash: str
    actor: str
 
def create_provenance_chain(
    records: list[dict],
) -> list[ProvenanceRecord]:
    """
    Create a tamper-evident provenance chain (similar to a blockchain).
    Each record contains the hash of the previous record, making
    it impossible to modify earlier records without detection.
    """
    chain = []
    previous_hash = "genesis"
 
    for record_data in records:
        record_content = json.dumps(record_data, sort_keys=True)
        data_hash = hashlib.sha256(record_content.encode()).hexdigest()
 
        chain_input = f"{data_hash}:{previous_hash}:{record_data.get('timestamp', 0)}"
        record_id = hashlib.sha256(chain_input.encode()).hexdigest()[:16]
 
        record = ProvenanceRecord(
            record_id=record_id,
            data_hash=data_hash,
            source=record_data.get("source", "unknown"),
            transformation=record_data.get("transformation", "none"),
            timestamp=record_data.get("timestamp", time.time()),
            previous_hash=previous_hash,
            actor=record_data.get("actor", "system"),
        )
 
        chain.append(record)
        previous_hash = record_id
 
    return chain
 
def verify_provenance_chain(
    chain: list[ProvenanceRecord],
) -> dict:
    """Verify the integrity of a provenance chain."""
    issues = []
 
    for i, record in enumerate(chain):
        # Verify chain linkage
        if i == 0:
            if record.previous_hash != "genesis":
                issues.append(f"Record {i}: invalid genesis (expected 'genesis')")
        else:
            if record.previous_hash != chain[i - 1].record_id:
                issues.append(
                    f"Record {i}: broken chain link "
                    f"(expected {chain[i - 1].record_id}, got {record.previous_hash})"
                )
 
        # Verify temporal ordering
        if i > 0 and record.timestamp < chain[i - 1].timestamp:
            issues.append(f"Record {i}: timestamp out of order")
 
    return {
        "chain_length": len(chain),
        "is_valid": len(issues) == 0,
        "issues": issues,
    }
 
# Demonstration
records = [
    {"source": "wikipedia", "transformation": "crawl", "timestamp": 1000.0, "actor": "crawler"},
    {"source": "wikipedia", "transformation": "dedup", "timestamp": 1001.0, "actor": "pipeline"},
    {"source": "wikipedia", "transformation": "filter", "timestamp": 1002.0, "actor": "pipeline"},
    {"source": "wikipedia", "transformation": "tokenize", "timestamp": 1003.0, "actor": "pipeline"},
]
 
chain = create_provenance_chain(records)
valid_result = verify_provenance_chain(chain)
print(f"Valid chain: {valid_result['is_valid']}")
 
# Tamper with the chain (modify an intermediate record)
tampered_chain = list(chain)
tampered_chain[1] = ProvenanceRecord(
    record_id="tampered_id",
    data_hash="fake_hash",
    source="malicious_source",
    transformation="poison",
    timestamp=1001.0,
    previous_hash=chain[0].record_id,
    actor="attacker",
)
tampered_result = verify_provenance_chain(tampered_chain)
print(f"Tampered chain: {tampered_result['is_valid']}")
print(f"Issues: {tampered_result['issues']}")

Defensive Strategies

Building Robust Attribution Systems

Multi-method attribution: Do not rely on a single attribution method. Combine influence functions, TracIn, TRAK, and representation-based methods. Poisoned examples that evade one method may be detected by another.
Continuous attribution monitoring: Compute attributions periodically during training, not just after training completes. This reduces the window for timing-based evasion attacks.
Adversarial attribution testing: Red team the attribution system itself by crafting examples designed to evade detection and measuring the detection rate.
Differential privacy as defense: Training with differential privacy bounds the influence of any single training example, limiting both membership inference success and the impact of data poisoning. The privacy-utility tradeoff must be carefully calibrated.
Provenance chain integrity: Implement tamper-evident provenance tracking for all training data, with cryptographic verification at each processing stage.

References

Koh, P. W., & Liang, P. (2017). "Understanding Black-Box Predictions via Influence Functions." ICML 2017.
Carlini, N., et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium 2021.
Carlini, N., et al. (2022). "Membership Inference Attacks From First Principles." IEEE S&P 2022.
Park, S., et al. (2023). "TRAK: Attributing Model Behavior at Scale." ICML 2023.
Pruthi, G., et al. (2020). "Estimating Training Data Influence by Tracing Gradient Descent." NeurIPS 2020.

Edit this page on GitHub

Security of Training Data Attribution Methods

Related articles

Security of Training Data Attribution Methods

Related articles