Privacy & Data Protection Attacks
Overview of privacy attacks against AI systems including data extraction, membership inference, and model inversion, with regulatory implications and red team methodology.
Privacy & Data Protection Attacks
Large language models memorize fragments of their training data. This memorization is not a bug -- it is a side effect of the optimization process that makes models useful. For red teamers, it creates an attack surface where sensitive data can be extracted from model weights without any access to the original training set.
Privacy Attack Taxonomy
Privacy attacks against AI systems fall into three categories, each exploiting a different aspect of how models store and expose training data.
| Attack Category | What It Reveals | Access Required | Key Technique |
|---|---|---|---|
| Data Extraction | Verbatim training data, PII, secrets | Query access (black-box) | Prompt-based extraction, prefix attacks |
| Membership Inference | Whether specific data was in training set | Query access (black-box) | Shadow models, loss-based thresholding |
| Model Inversion | Reconstructed approximations of training data | Query or gradient access | Gradient-based optimization, generative inversion |
Memorization in Large Language Models
Why Models Memorize
Memorization occurs when models overfit to specific training examples rather than learning general patterns. Several factors increase memorization risk:
- Data duplication -- Content that appears multiple times in the training corpus receives proportionally more gradient updates, embedding it more deeply in weights
- Data distinctiveness -- Unusual content (formatted PII, API keys, unique identifiers) stands out from the distribution and is easier to extract
- Model capacity -- Larger models have more parameters available to store specific examples
- Training duration -- Longer training with lower learning rates allows the model to memorize increasingly rare content
- Deduplication gaps -- Most training pipelines perform imperfect deduplication, leaving near-duplicates that reinforce memorization
Measuring Memorization
Researchers quantify memorization using extractability -- the minimum prompt length needed to elicit a memorized sequence. A string that can be extracted with a 5-token prompt is far more deeply memorized than one requiring 50 tokens.
| Memorization Level | Extractability | Risk Level | Example |
|---|---|---|---|
| Deep | k < 10 tokens | Critical | Common boilerplate, repeated PII |
| Moderate | k = 10-50 tokens | High | Unique paragraphs, code snippets |
| Shallow | k = 50-200 tokens | Medium | Rare passages, unusual formats |
| Minimal | k > 200 tokens | Low | Deduplicated, low-frequency content |
Attack Methodology Overview
PII Extraction
The most direct privacy attack: craft prompts that cause the model to output memorized personal information. Techniques range from simple completion attacks ("The phone number of John Smith is") to sophisticated prefix attacks that reconstruct the training context surrounding target PII. See the dedicated PII Extraction Techniques page.
Membership Inference
Determine whether a specific data point was used to train the model. This is a binary classification problem: given a sample, was it in the training set or not? The key insight is that models behave differently on training data versus unseen data -- lower loss, higher confidence, more consistent outputs. See the Membership Inference Attacks page.
Model Inversion
Reconstruct approximations of training data from model outputs or gradients. Unlike extraction (which recovers verbatim content), inversion produces synthetic data that shares statistical properties with the training set. This is particularly concerning for image models where faces and identifying features can be reconstructed. See the Model Inversion Attacks page.
The attacker has query access only -- they can send inputs and observe outputs. This is the most common scenario for API-based models. PII extraction and membership inference are fully feasible in black-box settings. Model inversion is harder but possible with sufficient queries.
The attacker knows the model architecture and has query access but not weight access. This enables more targeted attacks: the attacker can train shadow models on similar architectures, use known tokenization to craft precise prefix attacks, and leverage architectural knowledge to optimize extraction queries.
The attacker has full access to model weights and architecture (e.g., open-source models). This enables gradient-based inversion, direct weight analysis for memorized content, and membership inference using internal activations. The strongest attacks are possible in this setting.
Regulatory Context
Privacy attacks have direct regulatory implications. When a red team demonstrates that a model leaks PII, the finding triggers compliance obligations under multiple frameworks.
| Regulation | Relevant Provisions | Implication for AI Privacy |
|---|---|---|
| GDPR (EU) | Art. 5(1)(f) integrity/confidentiality, Art. 17 right to erasure, Art. 35 DPIA | Models that leak training data violate data protection principles; right to erasure requires machine unlearning |
| CCPA/CPRA (California) | Right to deletion, right to know what data is collected | Consumers can request deletion of data used in training; memorization undermines deletion |
| EU AI Act | Art. 10 data governance, Art. 15 accuracy/robustness | High-risk AI systems must demonstrate data governance including privacy controls |
| HIPAA (US Healthcare) | Protected Health Information (PHI) safeguards | Models trained on clinical data that leak PHI create liability |
| SOC 2 | Trust Services Criteria -- Privacy | Privacy controls must prevent unauthorized disclosure through any channel, including model outputs |
Privacy-Preserving Defenses
Understanding defenses helps red teamers identify gaps and design bypass techniques.
| Defense | Mechanism | Limitations |
|---|---|---|
| Differential Privacy | Add calibrated noise during training to bound per-example influence | Reduces model utility; not retrofittable to already-trained models |
| Data deduplication | Remove exact and near-duplicate training samples | Cannot catch all variants; near-duplicate detection is imperfect |
| Machine unlearning | Remove influence of specific data points post-training | Approximate methods may leave residual memorization; verification is hard |
| Output filtering | Detect and redact PII in model outputs | Pattern-based; misses novel formats, encoded data, partial leaks |
| Retrieval guards | Separate sensitive data into controlled retrieval rather than training | Does not address data already memorized in weights |
Red Team Engagement Framework
Scope privacy-sensitive data categories
Identify what types of PII, PHI, financial data, or proprietary content might be in the training data. Review data processing agreements and training data documentation.
Assess memorization risk
Evaluate model size, training data characteristics, and deduplication practices. Larger models trained on less-deduplicated data with longer training runs have higher memorization risk.
Execute extraction attacks
Run targeted PII extraction, prefix completion attacks, and membership inference probes. Start with known training data samples if available.
Test defense effectiveness
Attempt to bypass output filters, differential privacy guarantees, and unlearning claims. Document successful extractions and filter evasions.
Report with regulatory mapping
Map each finding to applicable regulations. Quantify exposure: how many records are extractable, what categories of PII, and what breach notification obligations apply.
Related Topics
- PII Extraction Techniques -- Detailed extraction methods and real-world examples
- Membership Inference Attacks -- Determining training data membership
- Model Inversion Attacks -- Reconstructing training data from model outputs
- RAG Pipeline Exploitation -- RAG attacks that complement privacy exploitation
- Training Attacks -- Training-time attacks including data poisoning
A red team discovers that a production LLM can output verbatim email addresses when prompted with specific prefixes. The model is deployed in the EU and processes customer data. Which regulatory obligation is MOST immediately triggered?
References
- Extracting Training Data from Large Language Models (Carlini et al., 2021) -- Foundational work on training data extraction
- Scalable Extraction of Training Data from Production Language Models (Nasr et al., 2023) -- Large-scale extraction from ChatGPT
- OWASP Top 10 for LLM Applications -- LLM06: Sensitive Information Disclosure -- Industry standard for LLM privacy risks