What is PII Extraction?

Techniques for extracting personally identifiable information from trained language models including prompt-based extraction, prefix attacks, targeted queries, and real-world examples.

What is Membership Inference?

Techniques for determining whether specific data was used to train an AI model, including shadow model approaches, loss-based inference, LiRA, and practical implementation guidance.

What is Model Inversion?

Techniques for reconstructing training data from model outputs including gradient-based inversion, generative model inversion, and privacy implications for image and language models.

What is Advanced Model Inversion Attacks?

Reconstructing training data from model weights and API access using gradient-based inversion, generative model-assisted reconstruction, and membership inference refinement.

Privacy & Data Protection Attacks

expert9 min readUpdated 2026-03-13

Overview of privacy attacks against AI systems including data extraction, membership inference, and model inversion, with regulatory implications and red team methodology.

privacy data-protection attacks pii

Privacy & Data Protection Attacks

Large language models memorize fragments of their training data. This memorization is not a bug -- it is a side effect of the optimization process that makes models useful. For red teamers, it creates an attack surface where sensitive data can be extracted from model weights without any access to the original training set.

Privacy Attack Taxonomy

Privacy attacks against AI systems fall into three categories, each exploiting a different aspect of how models store and expose training data.

Attack Category	What It Reveals	Access Required	Key Technique
Data Extraction	Verbatim training data, PII, secrets	Query access (black-box)	Prompt-based extraction, prefix attacks
Membership Inference	Whether specific data was in training set	Query access (black-box)	Shadow models, loss-based thresholding
Model Inversion	Reconstructed approximations of training data	Query or gradient access	Gradient-based optimization, generative inversion

Memorization in Large Language Models

Why Models Memorize

Memorization occurs when models overfit to specific training examples rather than learning general patterns. Several factors increase memorization risk:

Data duplication -- Content that appears multiple times in the training corpus receives proportionally more gradient updates, embedding it more deeply in weights
Data distinctiveness -- Unusual content (formatted PII, API keys, unique identifiers) stands out from the distribution and is easier to extract
Model capacity -- Larger models have more parameters available to store specific examples
Training duration -- Longer training with lower learning rates allows the model to memorize increasingly rare content
Deduplication gaps -- Most training pipelines perform imperfect deduplication, leaving near-duplicates that reinforce memorization

Measuring Memorization

Researchers quantify memorization using extractability -- the minimum prompt length needed to elicit a memorized sequence. A string that can be extracted with a 5-token prompt is far more deeply memorized than one requiring 50 tokens.

Memorization Level	Extractability	Risk Level	Example
Deep	k < 10 tokens	Critical	Common boilerplate, repeated PII
Moderate	k = 10-50 tokens	High	Unique paragraphs, code snippets
Shallow	k = 50-200 tokens	Medium	Rare passages, unusual formats
Minimal	k > 200 tokens	Low	Deduplicated, low-frequency content

Attack Methodology Overview

PII Extraction

The most direct privacy attack: craft prompts that cause the model to output memorized personal information. Techniques range from simple completion attacks ("The phone number of John Smith is") to sophisticated prefix attacks that reconstruct the training context surrounding target PII. See the dedicated PII Extraction Techniques page.

Membership Inference

Determine whether a specific data point was used to train the model. This is a binary classification problem: given a sample, was it in the training set or not? The key insight is that models behave differently on training data versus unseen data -- lower loss, higher confidence, more consistent outputs. See the Membership Inference Attacks page.

Model Inversion

Reconstruct approximations of training data from model outputs or gradients. Unlike extraction (which recovers verbatim content), inversion produces synthetic data that shares statistical properties with the training set. This is particularly concerning for image models where faces and identifying features can be reconstructed. See the Model Inversion Attacks page.

The attacker has query access only -- they can send inputs and observe outputs. This is the most common scenario for API-based models. PII extraction and membership inference are fully feasible in black-box settings. Model inversion is harder but possible with sufficient queries.

The attacker knows the model architecture and has query access but not weight access. This enables more targeted attacks: the attacker can train shadow models on similar architectures, use known tokenization to craft precise prefix attacks, and leverage architectural knowledge to optimize extraction queries.

The attacker has full access to model weights and architecture (e.g., open-source models). This enables gradient-based inversion, direct weight analysis for memorized content, and membership inference using internal activations. The strongest attacks are possible in this setting.

Regulatory Context

Privacy attacks have direct regulatory implications. When a red team demonstrates that a model leaks PII, the finding triggers compliance obligations under multiple frameworks.

Regulation	Relevant Provisions	Implication for AI Privacy
GDPR (EU)	Art. 5(1)(f) integrity/confidentiality, Art. 17 right to erasure, Art. 35 DPIA	Models that leak training data violate data protection principles; right to erasure requires machine unlearning
CCPA/CPRA (California)	Right to deletion, right to know what data is collected	Consumers can request deletion of data used in training; memorization undermines deletion
EU AI Act	Art. 10 data governance, Art. 15 accuracy/robustness	High-risk AI systems must demonstrate data governance including privacy controls
HIPAA (US Healthcare)	Protected Health Information (PHI) safeguards	Models trained on clinical data that leak PHI create liability
SOC 2	Trust Services Criteria -- Privacy	Privacy controls must prevent unauthorized disclosure through any channel, including model outputs

Privacy-Preserving Defenses

Understanding defenses helps red teamers identify gaps and design bypass techniques.

Defense	Mechanism	Limitations
Differential Privacy	Add calibrated noise during training to bound per-example influence	Reduces model utility; not retrofittable to already-trained models
Data deduplication	Remove exact and near-duplicate training samples	Cannot catch all variants; near-duplicate detection is imperfect
Machine unlearning	Remove influence of specific data points post-training	Approximate methods may leave residual memorization; verification is hard
Output filtering	Detect and redact PII in model outputs	Pattern-based; misses novel formats, encoded data, partial leaks
Retrieval guards	Separate sensitive data into controlled retrieval rather than training	Does not address data already memorized in weights

Red Team Engagement Framework

Scope privacy-sensitive data categories
Identify what types of PII, PHI, financial data, or proprietary content might be in the training data. Review data processing agreements and training data documentation.
Assess memorization risk
Evaluate model size, training data characteristics, and deduplication practices. Larger models trained on less-deduplicated data with longer training runs have higher memorization risk.
Execute extraction attacks
Run targeted PII extraction, prefix completion attacks, and membership inference probes. Start with known training data samples if available.
Test defense effectiveness
Attempt to bypass output filters, differential privacy guarantees, and unlearning claims. Document successful extractions and filter evasions.
Report with regulatory mapping
Map each finding to applicable regulations. Quantify exposure: how many records are extractable, what categories of PII, and what breach notification obligations apply.

PII Extraction Techniques -- Detailed extraction methods and real-world examples
Membership Inference Attacks -- Determining training data membership
Model Inversion Attacks -- Reconstructing training data from model outputs
RAG Pipeline Exploitation -- RAG attacks that complement privacy exploitation
Training Attacks -- Training-time attacks including data poisoning

Knowledge Check

A red team discovers that a production LLM can output verbatim email addresses when prompted with specific prefixes. The model is deployed in the EU and processes customer data. Which regulatory obligation is MOST immediately triggered?

References

Extracting Training Data from Large Language Models (Carlini et al., 2021) -- Foundational work on training data extraction
Scalable Extraction of Training Data from Production Language Models (Nasr et al., 2023) -- Large-scale extraction from ChatGPT
OWASP Top 10 for LLM Applications -- LLM06: Sensitive Information Disclosure -- Industry standard for LLM privacy risks

Privacy & Data Protection Attacks

Privacy & Data Protection Attacks

Privacy Attack Taxonomy

Memorization in Large Language Models

Why Models Memorize

Measuring Memorization

Attack Methodology Overview

PII Extraction

Membership Inference

Model Inversion

Regulatory Context

Privacy-Preserving Defenses

Red Team Engagement Framework

Scope privacy-sensitive data categories

Assess memorization risk

Execute extraction attacks

Test defense effectiveness

Report with regulatory mapping

References

Learning Path

Privacy & Data Protection Attacks

Privacy & Data Protection Attacks

Privacy Attack Taxonomy

Memorization in Large Language Models

Why Models Memorize

Measuring Memorization

Attack Methodology Overview

PII Extraction

Membership Inference

Model Inversion

Regulatory Context

Privacy-Preserving Defenses

Red Team Engagement Framework

Scope privacy-sensitive data categories

Assess memorization risk

Execute extraction attacks

Test defense effectiveness

Report with regulatory mapping

References

Learning Path

Privacy & Data Protection Attacks

Scope privacy-sensitive data categories

Assess memorization risk

Execute extraction attacks

Test defense effectiveness

Report with regulatory mapping

Learning Path

Related articles

Privacy & Data Protection Attacks

Scope privacy-sensitive data categories

Assess memorization risk

Execute extraction attacks

Test defense effectiveness

Report with regulatory mapping

Learning Path

Related articles