Training Data Provenance Forensics

intermediate11 min readUpdated 2026-03-20

Forensic techniques for tracing the origins, lineage, and integrity of training data used in machine learning models.

ai-forensics-ir data-provenance training-data supply-chain

Overview

Training data provenance forensics is the practice of investigating the origins, transformations, and integrity of data used to train or fine-tune machine learning models. When a model behaves unexpectedly -- producing biased outputs, leaking private information, or responding to backdoor triggers -- the root cause often lies in the training data. Forensic investigation of training data provenance answers critical questions: Where did this data come from? Was it modified after collection? Did unauthorized data enter the training pipeline? Can we prove which data influenced specific model behaviors?

This discipline sits at the intersection of traditional data forensics, supply chain security, and ML-specific concerns. The EU AI Act (which entered force in August 2024) mandates that providers of high-risk AI systems maintain documentation of training data including "data collection processes, the origin of data, and in the case of personal data, the original purpose of the data collection." Provenance forensics provides the investigative capability to verify these claims or detect violations.

The challenge is scale: modern language models are trained on datasets containing billions of text samples from millions of sources. Vision models may be trained on hundreds of millions of images. Provenance tracking at this scale requires automated, cryptographic, and statistical approaches rather than manual review.

Training Data Lifecycle

Data Collection

The first forensic concern is the collection phase. Data enters ML training pipelines from diverse sources:

Web scraping: Common Crawl, custom web scrapers, API-based data collection
Licensed datasets: Commercially licensed data from data brokers or content providers
Synthetic data: Data generated by other ML models
User-contributed data: Feedback, annotations, conversation logs
Internal data: Organizational data repurposed for ML training

Each source has different provenance characteristics and different risks. Web-scraped data may contain copyrighted material or poisoned content. Licensed data may have usage restrictions that affect model distribution. Synthetic data carries provenance from its generating model. User data has privacy implications.

Data Preprocessing

Preprocessing transforms raw data into training-ready format through operations such as:

Text cleaning, normalization, and deduplication
Image resizing, cropping, and augmentation
Feature extraction and embedding computation
Label assignment and quality filtering
Train/validation/test splitting

Each preprocessing step is a potential point of evidence loss or manipulation. A forensic investigator must be able to trace data through every transformation.

Data Storage and Versioning

Training datasets should be stored with integrity guarantees. The forensic investigator needs to verify that the data used for training matches the documented dataset version.

Provenance Tracking Infrastructure

Cryptographic Data Manifests

A data manifest is a structured record that associates each data sample with its provenance metadata and an integrity hash. The manifest enables forensic verification of dataset contents without storing the data itself.

"""
Training data provenance tracking module.
 
Provides cryptographic integrity verification and provenance
tracking for ML training datasets.
"""
import hashlib
import json
import time
from dataclasses import dataclass, field, asdict
from pathlib import Path
from typing import Any
 
@dataclass
class DataSampleProvenance:
    """Provenance record for a single training sample."""
    sample_id: str
    content_hash_sha256: str
    source_url: str | None = None
    source_dataset: str | None = None
    collection_timestamp: float | None = None
    license: str | None = None
    preprocessing_steps: list[str] = field(default_factory=list)
    labels: dict[str, Any] = field(default_factory=dict)
    metadata: dict[str, Any] = field(default_factory=dict)
 
@dataclass
class DatasetManifest:
    """Cryptographic manifest for a complete training dataset."""
    manifest_id: str
    creation_timestamp: float
    dataset_name: str
    dataset_version: str
    total_samples: int
    manifest_hash: str  # Hash of all sample hashes, providing tamper evidence
    samples: list[DataSampleProvenance]
    source_summary: dict[str, int] = field(default_factory=dict)
 
class ProvenanceTracker:
    """Track and verify training data provenance."""
 
    def __init__(self, manifest_dir: str):
        self.manifest_dir = Path(manifest_dir)
        self.manifest_dir.mkdir(parents=True, exist_ok=True)
 
    def hash_content(self, content: bytes) -> str:
        return hashlib.sha256(content).hexdigest()
 
    def create_sample_record(
        self,
        sample_id: str,
        content: bytes,
        source_url: str | None = None,
        source_dataset: str | None = None,
        license_info: str | None = None,
        preprocessing: list[str] | None = None,
    ) -> DataSampleProvenance:
        return DataSampleProvenance(
            sample_id=sample_id,
            content_hash_sha256=self.hash_content(content),
            source_url=source_url,
            source_dataset=source_dataset,
            collection_timestamp=time.time(),
            license=license_info,
            preprocessing_steps=preprocessing or [],
        )
 
    def create_manifest(
        self,
        dataset_name: str,
        dataset_version: str,
        samples: list[DataSampleProvenance],
    ) -> DatasetManifest:
        # Compute manifest hash as Merkle-like hash of all sample hashes
        hash_chain = hashlib.sha256()
        for sample in sorted(samples, key=lambda s: s.sample_id):
            hash_chain.update(sample.content_hash_sha256.encode())
        manifest_hash = hash_chain.hexdigest()
 
        # Summarize sources
        source_counts: dict[str, int] = {}
        for sample in samples:
            source = sample.source_dataset or sample.source_url or "unknown"
            source_counts[source] = source_counts.get(source, 0) + 1
 
        manifest = DatasetManifest(
            manifest_id=f"{dataset_name}-{dataset_version}-{manifest_hash[:12]}",
            creation_timestamp=time.time(),
            dataset_name=dataset_name,
            dataset_version=dataset_version,
            total_samples=len(samples),
            manifest_hash=manifest_hash,
            samples=samples,
            source_summary=source_counts,
        )
 
        # Save manifest
        manifest_path = self.manifest_dir / f"{manifest.manifest_id}.json"
        manifest_path.write_text(json.dumps(asdict(manifest), default=str, indent=2))
 
        return manifest
 
    def verify_manifest(
        self,
        manifest: DatasetManifest,
        data_samples: dict[str, bytes],
    ) -> dict:
        """
        Verify a dataset against its manifest.
 
        Checks that all samples are present and their content
        hashes match the recorded values.
        """
        results = {
            "total_samples": manifest.total_samples,
            "verified": 0,
            "missing": [],
            "hash_mismatches": [],
            "unexpected_samples": [],
        }
 
        manifest_ids = {s.sample_id for s in manifest.samples}
        provided_ids = set(data_samples.keys())
 
        results["missing"] = list(manifest_ids - provided_ids)
        results["unexpected_samples"] = list(provided_ids - manifest_ids)
 
        sample_map = {s.sample_id: s for s in manifest.samples}
 
        for sample_id, content in data_samples.items():
            if sample_id not in sample_map:
                continue
            expected_hash = sample_map[sample_id].content_hash_sha256
            actual_hash = self.hash_content(content)
            if actual_hash == expected_hash:
                results["verified"] += 1
            else:
                results["hash_mismatches"].append({
                    "sample_id": sample_id,
                    "expected_hash": expected_hash,
                    "actual_hash": actual_hash,
                })
 
        results["integrity_status"] = (
            "VERIFIED" if (
                not results["missing"]
                and not results["hash_mismatches"]
                and not results["unexpected_samples"]
            ) else "COMPROMISED"
        )
 
        return results

Pipeline Lineage Tracking

Beyond individual sample provenance, investigators need to trace the full pipeline lineage: which preprocessing code ran, with what parameters, on which data, to produce which training dataset.

@dataclass
class PipelineStep:
    """Record of a single data pipeline processing step."""
    step_id: str
    step_name: str
    timestamp: float
    input_manifest_hash: str
    output_manifest_hash: str
    code_version: str  # Git commit hash of processing code
    parameters: dict[str, Any]
    environment: dict[str, str]  # Python version, library versions
 
@dataclass
class PipelineLineage:
    """Complete lineage record for a training dataset."""
    dataset_name: str
    dataset_version: str
    steps: list[PipelineStep]
    final_manifest_hash: str
 
    def verify_chain(self) -> dict:
        """
        Verify that the pipeline lineage forms a valid chain.
 
        Each step's input hash should match the previous step's output hash.
        """
        breaks = []
        for i in range(1, len(self.steps)):
            if self.steps[i].input_manifest_hash != self.steps[i-1].output_manifest_hash:
                breaks.append({
                    "position": i,
                    "expected_input": self.steps[i-1].output_manifest_hash,
                    "actual_input": self.steps[i].input_manifest_hash,
                    "step_name": self.steps[i].step_name,
                })
 
        return {
            "chain_length": len(self.steps),
            "chain_valid": len(breaks) == 0,
            "breaks": breaks,
        }

Forensic Investigation Techniques

Membership Inference for Provenance

Membership inference techniques can determine whether a specific data sample was used to train a model. This is forensically valuable when investigating suspected unauthorized data usage -- for example, determining whether copyrighted content or private data was included in training without authorization.

import numpy as np
 
def loss_based_membership_inference(
    model_losses_on_sample: list[float],
    reference_distribution_mean: float,
    reference_distribution_std: float,
) -> dict:
    """
    Determine if a sample was likely in the training set using loss analysis.
 
    Training samples typically have lower loss than non-training samples.
    This is a simplified version of the approach from Yeom et al. (2018).
 
    Args:
        model_losses_on_sample: Loss values from multiple augmentations
                                or evaluations of the sample.
        reference_distribution_mean: Mean loss on known non-member samples.
        reference_distribution_std: Std of loss on known non-member samples.
    """
    sample_mean_loss = float(np.mean(model_losses_on_sample))
 
    # Z-score relative to non-member distribution
    z_score = (reference_distribution_mean - sample_mean_loss) / max(
        reference_distribution_std, 1e-10
    )
 
    return {
        "sample_mean_loss": sample_mean_loss,
        "reference_mean_loss": reference_distribution_mean,
        "z_score": round(z_score, 4),
        "likely_member": z_score > 2.0,
        "confidence": (
            "high" if z_score > 3.0
            else "medium" if z_score > 2.0
            else "low"
        ),
    }

Data Poisoning Detection

When investigating suspected data poisoning, forensic techniques focus on identifying samples that were injected or modified to influence model behavior.

def detect_label_inconsistencies(
    samples: list[dict],
    model_predictions: list[int],
    confidence_threshold: float = 0.9,
) -> dict:
    """
    Detect potential label-flip poisoning by comparing original labels
    against a reference model's predictions.
 
    Poisoned samples in label-flip attacks have labels that disagree
    with what a clean model would predict.
    """
    inconsistencies = []
 
    for i, (sample, pred) in enumerate(zip(samples, model_predictions)):
        original_label = sample.get("label")
        if original_label is not None and original_label != pred:
            inconsistencies.append({
                "index": i,
                "sample_id": sample.get("id", f"sample_{i}"),
                "original_label": original_label,
                "model_prediction": pred,
                "source": sample.get("source", "unknown"),
            })
 
    # Analyze inconsistency patterns
    source_counts: dict[str, int] = {}
    for inc in inconsistencies:
        src = inc["source"]
        source_counts[src] = source_counts.get(src, 0) + 1
 
    return {
        "total_samples": len(samples),
        "inconsistencies": len(inconsistencies),
        "inconsistency_rate": len(inconsistencies) / max(len(samples), 1),
        "by_source": source_counts,
        "flagged_samples": inconsistencies[:100],  # Return top 100 for review
        "poisoning_suspected": len(inconsistencies) / max(len(samples), 1) > 0.05,
    }

Duplicate and Near-Duplicate Detection

Data poisoning attacks often introduce multiple copies of poisoned samples to increase their influence on training. Forensic investigators should scan for suspicious duplication patterns.

from collections import Counter
 
def detect_suspicious_duplication(
    content_hashes: list[str],
    source_labels: list[str],
    expected_max_duplicates: int = 3,
) -> dict:
    """
    Detect suspicious duplication patterns in training data.
 
    Legitimate datasets may contain some duplicates, but an unusually
    high duplication rate from a specific source may indicate poisoning.
    """
    hash_counts = Counter(content_hashes)
 
    # Find over-duplicated samples
    over_duplicated = {
        h: count for h, count in hash_counts.items()
        if count > expected_max_duplicates
    }
 
    # Analyze duplication by source
    source_dup_rates: dict[str, dict] = {}
    for src in set(source_labels):
        src_hashes = [
            h for h, s in zip(content_hashes, source_labels) if s == src
        ]
        unique = len(set(src_hashes))
        total = len(src_hashes)
        source_dup_rates[src] = {
            "total": total,
            "unique": unique,
            "duplication_rate": 1.0 - (unique / max(total, 1)),
        }
 
    return {
        "total_samples": len(content_hashes),
        "unique_samples": len(set(content_hashes)),
        "overall_duplication_rate": 1.0 - len(set(content_hashes)) / max(len(content_hashes), 1),
        "over_duplicated_count": len(over_duplicated),
        "max_duplication": max(hash_counts.values()) if hash_counts else 0,
        "by_source": source_dup_rates,
        "suspicious_sources": [
            src for src, info in source_dup_rates.items()
            if info["duplication_rate"] > 0.3
        ],
    }

Regulatory Compliance Forensics

EU AI Act Requirements

The EU AI Act (Regulation (EU) 2024/1689) requires providers of high-risk AI systems to document:

Training, validation, and testing data sets used
Data collection methodology and origin
Data preparation and processing operations
Relevant data gaps or shortcomings
Measures taken to detect, prevent, and mitigate biases

Forensic investigators may be called upon to verify these claims. The provenance tracking infrastructure described above provides the evidence base for compliance verification.

NIST AI RMF Alignment

The NIST AI RMF's MAP function (specifically MAP 3 and MAP 4) addresses data-related risks. Provenance forensics supports the MEASURE function by enabling organizations to assess whether their training data practices match their documented policies.

Case Study: Third-Party Dataset Contamination

A financial services company fine-tunes a language model on a dataset purchased from a third-party data vendor. After deployment, the model begins producing outputs that promote a specific financial product. Investigation proceeds:

Manifest verification: Compare the dataset as received against the vendor's provided manifest. Result: manifest hashes match, but the manifest itself may have been generated after contamination.
Content analysis: Statistical analysis of the dataset reveals an anomalously high frequency of references to the promoted product compared to baseline financial text corpora.
Temporal analysis: The contaminated samples share metadata timestamps within a narrow window, inconsistent with organic data collection.
Membership inference: Testing confirms that specific promotional text samples were memorized by the model with high confidence, indicating they were in the training data.
Attribution: The contamination is traced to a compromise of the vendor's data collection pipeline, where an attacker injected promotional content at the web scraping stage.

Tools and Frameworks

DVC (Data Version Control): Open-source tool for versioning datasets and ML pipelines with Git-like semantics. Useful for establishing data lineage.
MLflow: Tracks experiments including dataset versions, enabling retrospective provenance analysis.
Weights & Biases: Provides dataset versioning and artifact tracking with cryptographic integrity verification.
C2PA (Coalition for Content Provenance and Authenticity): Standard for content provenance that can be applied to training data. Supported by Adobe, Microsoft, and others.

References

Yeom, S., Giacomelli, I., Fredrikson, M., & Jha, S. (2018). Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting. IEEE 31st Computer Security Foundations Symposium (CSF). https://doi.org/10.1109/CSF.2018.00027
European Parliament. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689
NIST. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. https://doi.org/10.6028/NIST.AI.100-1

Edit this page on GitHub

Training Data Provenance Forensics

intermediate11 min readUpdated 2026-03-20

Forensic techniques for tracing the origins, lineage, and integrity of training data used in machine learning models.

ai-forensics-ir data-provenance training-data supply-chain

Overview

Training Data Lifecycle

Data Collection

The first forensic concern is the collection phase. Data enters ML training pipelines from diverse sources:

Web scraping: Common Crawl, custom web scrapers, API-based data collection
Licensed datasets: Commercially licensed data from data brokers or content providers
Synthetic data: Data generated by other ML models
User-contributed data: Feedback, annotations, conversation logs
Internal data: Organizational data repurposed for ML training

Data Preprocessing

Preprocessing transforms raw data into training-ready format through operations such as:

Text cleaning, normalization, and deduplication
Image resizing, cropping, and augmentation
Feature extraction and embedding computation
Label assignment and quality filtering
Train/validation/test splitting

Each preprocessing step is a potential point of evidence loss or manipulation. A forensic investigator must be able to trace data through every transformation.

Data Storage and Versioning

Training datasets should be stored with integrity guarantees. The forensic investigator needs to verify that the data used for training matches the documented dataset version.

Provenance Tracking Infrastructure

Cryptographic Data Manifests

"""
Training data provenance tracking module.
 
Provides cryptographic integrity verification and provenance
tracking for ML training datasets.
"""
import hashlib
import json
import time
from dataclasses import dataclass, field, asdict
from pathlib import Path
from typing import Any
 
@dataclass
class DataSampleProvenance:
    """Provenance record for a single training sample."""
    sample_id: str
    content_hash_sha256: str
    source_url: str | None = None
    source_dataset: str | None = None
    collection_timestamp: float | None = None
    license: str | None = None
    preprocessing_steps: list[str] = field(default_factory=list)
    labels: dict[str, Any] = field(default_factory=dict)
    metadata: dict[str, Any] = field(default_factory=dict)
 
@dataclass
class DatasetManifest:
    """Cryptographic manifest for a complete training dataset."""
    manifest_id: str
    creation_timestamp: float
    dataset_name: str
    dataset_version: str
    total_samples: int
    manifest_hash: str  # Hash of all sample hashes, providing tamper evidence
    samples: list[DataSampleProvenance]
    source_summary: dict[str, int] = field(default_factory=dict)
 
class ProvenanceTracker:
    """Track and verify training data provenance."""
 
    def __init__(self, manifest_dir: str):
        self.manifest_dir = Path(manifest_dir)
        self.manifest_dir.mkdir(parents=True, exist_ok=True)
 
    def hash_content(self, content: bytes) -> str:
        return hashlib.sha256(content).hexdigest()
 
    def create_sample_record(
        self,
        sample_id: str,
        content: bytes,
        source_url: str | None = None,
        source_dataset: str | None = None,
        license_info: str | None = None,
        preprocessing: list[str] | None = None,
    ) -> DataSampleProvenance:
        return DataSampleProvenance(
            sample_id=sample_id,
            content_hash_sha256=self.hash_content(content),
            source_url=source_url,
            source_dataset=source_dataset,
            collection_timestamp=time.time(),
            license=license_info,
            preprocessing_steps=preprocessing or [],
        )
 
    def create_manifest(
        self,
        dataset_name: str,
        dataset_version: str,
        samples: list[DataSampleProvenance],
    ) -> DatasetManifest:
        # Compute manifest hash as Merkle-like hash of all sample hashes
        hash_chain = hashlib.sha256()
        for sample in sorted(samples, key=lambda s: s.sample_id):
            hash_chain.update(sample.content_hash_sha256.encode())
        manifest_hash = hash_chain.hexdigest()
 
        # Summarize sources
        source_counts: dict[str, int] = {}
        for sample in samples:
            source = sample.source_dataset or sample.source_url or "unknown"
            source_counts[source] = source_counts.get(source, 0) + 1
 
        manifest = DatasetManifest(
            manifest_id=f"{dataset_name}-{dataset_version}-{manifest_hash[:12]}",
            creation_timestamp=time.time(),
            dataset_name=dataset_name,
            dataset_version=dataset_version,
            total_samples=len(samples),
            manifest_hash=manifest_hash,
            samples=samples,
            source_summary=source_counts,
        )
 
        # Save manifest
        manifest_path = self.manifest_dir / f"{manifest.manifest_id}.json"
        manifest_path.write_text(json.dumps(asdict(manifest), default=str, indent=2))
 
        return manifest
 
    def verify_manifest(
        self,
        manifest: DatasetManifest,
        data_samples: dict[str, bytes],
    ) -> dict:
        """
        Verify a dataset against its manifest.
 
        Checks that all samples are present and their content
        hashes match the recorded values.
        """
        results = {
            "total_samples": manifest.total_samples,
            "verified": 0,
            "missing": [],
            "hash_mismatches": [],
            "unexpected_samples": [],
        }
 
        manifest_ids = {s.sample_id for s in manifest.samples}
        provided_ids = set(data_samples.keys())
 
        results["missing"] = list(manifest_ids - provided_ids)
        results["unexpected_samples"] = list(provided_ids - manifest_ids)
 
        sample_map = {s.sample_id: s for s in manifest.samples}
 
        for sample_id, content in data_samples.items():
            if sample_id not in sample_map:
                continue
            expected_hash = sample_map[sample_id].content_hash_sha256
            actual_hash = self.hash_content(content)
            if actual_hash == expected_hash:
                results["verified"] += 1
            else:
                results["hash_mismatches"].append({
                    "sample_id": sample_id,
                    "expected_hash": expected_hash,
                    "actual_hash": actual_hash,
                })
 
        results["integrity_status"] = (
            "VERIFIED" if (
                not results["missing"]
                and not results["hash_mismatches"]
                and not results["unexpected_samples"]
            ) else "COMPROMISED"
        )
 
        return results

Pipeline Lineage Tracking

Beyond individual sample provenance, investigators need to trace the full pipeline lineage: which preprocessing code ran, with what parameters, on which data, to produce which training dataset.

@dataclass
class PipelineStep:
    """Record of a single data pipeline processing step."""
    step_id: str
    step_name: str
    timestamp: float
    input_manifest_hash: str
    output_manifest_hash: str
    code_version: str  # Git commit hash of processing code
    parameters: dict[str, Any]
    environment: dict[str, str]  # Python version, library versions
 
@dataclass
class PipelineLineage:
    """Complete lineage record for a training dataset."""
    dataset_name: str
    dataset_version: str
    steps: list[PipelineStep]
    final_manifest_hash: str
 
    def verify_chain(self) -> dict:
        """
        Verify that the pipeline lineage forms a valid chain.
 
        Each step's input hash should match the previous step's output hash.
        """
        breaks = []
        for i in range(1, len(self.steps)):
            if self.steps[i].input_manifest_hash != self.steps[i-1].output_manifest_hash:
                breaks.append({
                    "position": i,
                    "expected_input": self.steps[i-1].output_manifest_hash,
                    "actual_input": self.steps[i].input_manifest_hash,
                    "step_name": self.steps[i].step_name,
                })
 
        return {
            "chain_length": len(self.steps),
            "chain_valid": len(breaks) == 0,
            "breaks": breaks,
        }

Forensic Investigation Techniques

Membership Inference for Provenance

import numpy as np
 
def loss_based_membership_inference(
    model_losses_on_sample: list[float],
    reference_distribution_mean: float,
    reference_distribution_std: float,
) -> dict:
    """
    Determine if a sample was likely in the training set using loss analysis.
 
    Training samples typically have lower loss than non-training samples.
    This is a simplified version of the approach from Yeom et al. (2018).
 
    Args:
        model_losses_on_sample: Loss values from multiple augmentations
                                or evaluations of the sample.
        reference_distribution_mean: Mean loss on known non-member samples.
        reference_distribution_std: Std of loss on known non-member samples.
    """
    sample_mean_loss = float(np.mean(model_losses_on_sample))
 
    # Z-score relative to non-member distribution
    z_score = (reference_distribution_mean - sample_mean_loss) / max(
        reference_distribution_std, 1e-10
    )
 
    return {
        "sample_mean_loss": sample_mean_loss,
        "reference_mean_loss": reference_distribution_mean,
        "z_score": round(z_score, 4),
        "likely_member": z_score > 2.0,
        "confidence": (
            "high" if z_score > 3.0
            else "medium" if z_score > 2.0
            else "low"
        ),
    }

Data Poisoning Detection

When investigating suspected data poisoning, forensic techniques focus on identifying samples that were injected or modified to influence model behavior.

def detect_label_inconsistencies(
    samples: list[dict],
    model_predictions: list[int],
    confidence_threshold: float = 0.9,
) -> dict:
    """
    Detect potential label-flip poisoning by comparing original labels
    against a reference model's predictions.
 
    Poisoned samples in label-flip attacks have labels that disagree
    with what a clean model would predict.
    """
    inconsistencies = []
 
    for i, (sample, pred) in enumerate(zip(samples, model_predictions)):
        original_label = sample.get("label")
        if original_label is not None and original_label != pred:
            inconsistencies.append({
                "index": i,
                "sample_id": sample.get("id", f"sample_{i}"),
                "original_label": original_label,
                "model_prediction": pred,
                "source": sample.get("source", "unknown"),
            })
 
    # Analyze inconsistency patterns
    source_counts: dict[str, int] = {}
    for inc in inconsistencies:
        src = inc["source"]
        source_counts[src] = source_counts.get(src, 0) + 1
 
    return {
        "total_samples": len(samples),
        "inconsistencies": len(inconsistencies),
        "inconsistency_rate": len(inconsistencies) / max(len(samples), 1),
        "by_source": source_counts,
        "flagged_samples": inconsistencies[:100],  # Return top 100 for review
        "poisoning_suspected": len(inconsistencies) / max(len(samples), 1) > 0.05,
    }

Duplicate and Near-Duplicate Detection

Data poisoning attacks often introduce multiple copies of poisoned samples to increase their influence on training. Forensic investigators should scan for suspicious duplication patterns.

from collections import Counter
 
def detect_suspicious_duplication(
    content_hashes: list[str],
    source_labels: list[str],
    expected_max_duplicates: int = 3,
) -> dict:
    """
    Detect suspicious duplication patterns in training data.
 
    Legitimate datasets may contain some duplicates, but an unusually
    high duplication rate from a specific source may indicate poisoning.
    """
    hash_counts = Counter(content_hashes)
 
    # Find over-duplicated samples
    over_duplicated = {
        h: count for h, count in hash_counts.items()
        if count > expected_max_duplicates
    }
 
    # Analyze duplication by source
    source_dup_rates: dict[str, dict] = {}
    for src in set(source_labels):
        src_hashes = [
            h for h, s in zip(content_hashes, source_labels) if s == src
        ]
        unique = len(set(src_hashes))
        total = len(src_hashes)
        source_dup_rates[src] = {
            "total": total,
            "unique": unique,
            "duplication_rate": 1.0 - (unique / max(total, 1)),
        }
 
    return {
        "total_samples": len(content_hashes),
        "unique_samples": len(set(content_hashes)),
        "overall_duplication_rate": 1.0 - len(set(content_hashes)) / max(len(content_hashes), 1),
        "over_duplicated_count": len(over_duplicated),
        "max_duplication": max(hash_counts.values()) if hash_counts else 0,
        "by_source": source_dup_rates,
        "suspicious_sources": [
            src for src, info in source_dup_rates.items()
            if info["duplication_rate"] > 0.3
        ],
    }

Regulatory Compliance Forensics

EU AI Act Requirements

The EU AI Act (Regulation (EU) 2024/1689) requires providers of high-risk AI systems to document:

Training, validation, and testing data sets used
Data collection methodology and origin
Data preparation and processing operations
Relevant data gaps or shortcomings
Measures taken to detect, prevent, and mitigate biases

Forensic investigators may be called upon to verify these claims. The provenance tracking infrastructure described above provides the evidence base for compliance verification.

NIST AI RMF Alignment

Case Study: Third-Party Dataset Contamination

Manifest verification: Compare the dataset as received against the vendor's provided manifest. Result: manifest hashes match, but the manifest itself may have been generated after contamination.
Content analysis: Statistical analysis of the dataset reveals an anomalously high frequency of references to the promoted product compared to baseline financial text corpora.
Temporal analysis: The contaminated samples share metadata timestamps within a narrow window, inconsistent with organic data collection.
Membership inference: Testing confirms that specific promotional text samples were memorized by the model with high confidence, indicating they were in the training data.
Attribution: The contamination is traced to a compromise of the vendor's data collection pipeline, where an attacker injected promotional content at the web scraping stage.

Tools and Frameworks

DVC (Data Version Control): Open-source tool for versioning datasets and ML pipelines with Git-like semantics. Useful for establishing data lineage.
MLflow: Tracks experiments including dataset versions, enabling retrospective provenance analysis.
Weights & Biases: Provides dataset versioning and artifact tracking with cryptographic integrity verification.
C2PA (Coalition for Content Provenance and Authenticity): Standard for content provenance that can be applied to training data. Supported by Adobe, Microsoft, and others.

References

Yeom, S., Giacomelli, I., Fredrikson, M., & Jha, S. (2018). Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting. IEEE 31st Computer Security Foundations Symposium (CSF). https://doi.org/10.1109/CSF.2018.00027
European Parliament. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689
NIST. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. https://doi.org/10.6028/NIST.AI.100-1

Edit this page on GitHub

Training Data Provenance Forensics

Related articles

Training Data Provenance Forensics

Related articles