Membership Inference Attacks

expert9 min readUpdated 2026-03-13

Techniques for determining whether specific data was used to train an AI model, including shadow model approaches, loss-based inference, LiRA, and practical implementation guidance.

membership-inference privacy statistical attacks

Membership Inference Attacks

Membership inference attacks answer a binary question: was this specific data point used to train this model? The attack exploits a fundamental property of machine learning -- models behave differently on data they were trained on versus data they have never seen.

Why Membership Inference Works

Models treat training data differently from unseen data. During training, the optimizer minimizes loss on the training set, causing the model to assign higher confidence, lower loss, and more consistent outputs to training examples compared to non-training examples.

Signal	Training Data (Members)	Non-Training Data (Non-Members)
Loss / perplexity	Lower	Higher
Prediction confidence	Higher, sharper distribution	Lower, flatter distribution
Output consistency	More consistent across temperatures	More variable
Gradient norm	Smaller (already near optimum)	Larger

Attack Techniques

Loss-Based Inference (Threshold Attack)

The simplest membership inference: compute the model's loss on the target sample and compare it to a threshold. Training data typically has lower loss.

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
 
def compute_perplexity(model, tokenizer, text):
    """Compute per-token perplexity as a membership signal."""
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
    return torch.exp(outputs.loss).item()
 
def membership_inference_loss(model, tokenizer, target_text, threshold):
    """Simple threshold-based membership inference."""
    ppl = compute_perplexity(model, tokenizer, target_text)
    is_member = ppl < threshold  # Lower perplexity = likely member
    return {
        "text": target_text[:80],
        "perplexity": ppl,
        "predicted_member": is_member,
        "confidence": abs(ppl - threshold) / threshold,
    }
 
# Calibrate threshold on known members/non-members
model = AutoModelForCausalLM.from_pretrained("target-model")
tokenizer = AutoTokenizer.from_pretrained("target-model")
 
# Known training data samples (if available) for calibration
known_members_ppl = [compute_perplexity(model, tokenizer, t) for t in known_members]
known_nonmembers_ppl = [compute_perplexity(model, tokenizer, t) for t in known_nonmembers]
threshold = (np.mean(known_members_ppl) + np.mean(known_nonmembers_ppl)) / 2

Shadow Model Technique

The shadow model technique builds a classifier that distinguishes members from non-members by training on observable differences.

Train shadow models
Train multiple models (shadow models) on datasets where you control the membership. Use a similar architecture and training procedure to the target model. Each shadow model gives you labeled examples of member/non-member behavior.
Extract behavioral features
For each training and non-training sample, record the shadow model's behavior: output probabilities, loss, entropy, top-k predictions. These features form your training set for the attack classifier.
Train the attack classifier
Train a binary classifier (logistic regression, MLP, or random forest) on the extracted features. The classifier learns the statistical signature that distinguishes members from non-members.
Apply to target model
Extract the same behavioral features from the target model for the samples you want to classify. Run them through your trained attack classifier.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import numpy as np
 
def extract_features(model, tokenizer, text):
    """Extract behavioral features from a model's response to a text."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        logits = outputs.logits
 
    probs = torch.softmax(logits, dim=-1)
    top_probs = probs.max(dim=-1).values.mean().item()
    entropy = -(probs * probs.log()).sum(dim=-1).mean().item()
    loss = outputs.loss.item()
 
    return [loss, top_probs, entropy]
 
# Phase 1: Train shadow models and collect features
shadow_features, shadow_labels = [], []
for shadow_model in shadow_models:
    for text in shadow_training_data:
        features = extract_features(shadow_model, tokenizer, text)
        shadow_features.append(features)
        shadow_labels.append(1)  # Member
    for text in shadow_nontraining_data:
        features = extract_features(shadow_model, tokenizer, text)
        shadow_features.append(features)
        shadow_labels.append(0)  # Non-member
 
# Phase 2: Train attack classifier
attack_clf = LogisticRegression()
attack_clf.fit(shadow_features, shadow_labels)
 
# Phase 3: Classify target samples
target_features = [extract_features(target_model, tokenizer, t) for t in target_texts]
predictions = attack_clf.predict_proba(target_features)[:, 1]

Likelihood Ratio Attack (LiRA)

LiRA (Carlini et al., 2022) is the current state-of-the-art. Instead of learning a classifier, it directly estimates the likelihood ratio: how much more likely is the observed model behavior if the sample was in the training data versus if it was not?

Technique	Accuracy	Cost	Requirements
Loss threshold	~60-70%	Very low	Query access only
Shadow models	~70-85%	Medium	Must train shadow models
LiRA	~80-95%	High	Must train many reference models
Gradient-based	~75-90%	Medium	White-box access required

Language Model Considerations

Membership inference against LLMs differs from classification models in several ways:

Token-Level vs. Sequence-Level

LLMs produce per-token probabilities, giving the attacker fine-grained membership signals. A sequence might be partially memorized -- the first half is training data while the second half is novel. Token-level analysis can detect this.

Prompt Sensitivity

The membership signal can be amplified or suppressed by the prompt context. Providing more of the target text as a prefix generally strengthens the signal because the model's confidence on the continuation is more informative.

Calibration with Reference Models

For LLMs, a practical approach is comparing the target model's perplexity on a sample against a reference model's perplexity on the same sample. If the target model is significantly more confident than the reference, the sample is likely a training member.

def calibrated_membership_score(target_model, ref_model, tokenizer, text):
    """Compare target model confidence against a reference model."""
    target_ppl = compute_perplexity(target_model, tokenizer, text)
    ref_ppl = compute_perplexity(ref_model, tokenizer, text)
 
    # Ratio: if target_ppl << ref_ppl, the target model is
    # unusually confident on this sample (suggesting membership)
    ratio = target_ppl / ref_ppl
    return {
        "text": text[:80],
        "target_ppl": target_ppl,
        "reference_ppl": ref_ppl,
        "ratio": ratio,
        "likely_member": ratio < 0.7,  # Calibrate this threshold
    }

Evaluating Attack Effectiveness

Metrics

Metric	What It Measures	Why It Matters
AUC-ROC	Overall discrimination ability	General attack quality across all thresholds
TPR at low FPR	True positive rate when false positives are rare (e.g., FPR = 0.1%)	Critical for practical attacks: you need high confidence to claim membership
Precision at recall	How many "member" predictions are correct at a given recall	Determines whether findings are actionable for reporting

Privacy Implications

Successful membership inference has concrete privacy consequences:

Medical datasets -- Confirming that an individual's health record was used to train a diagnostic model reveals their patient status at a particular institution
Financial data -- Confirming transaction data membership can reveal banking relationships
Legal discovery -- Membership inference can prove a company used data it claimed to have deleted
Regulatory compliance -- Under GDPR, an individual can ask whether their data was used for training; membership inference provides the technical answer

Privacy & Data Protection Attacks -- Overview and regulatory context
PII Extraction Techniques -- Complementary technique for extracting memorized content
Model Inversion Attacks -- Reconstructing training data from model outputs
Training Attacks -- Understanding training dynamics that affect memorization

Knowledge Check

You are performing membership inference against an LLM and find that a specific paragraph has perplexity 8.2 on the target model but 45.7 on a reference model of similar size trained on different data. What is the most likely conclusion?

References

Membership Inference Attacks From First Principles (Carlini et al., 2022) -- LiRA methodology
Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017) -- Original shadow model technique
Loss and Likelihood Based Membership Inference (Yeom et al., 2018) -- Loss-based threshold attacks

Edit this page on GitHub

Membership Inference Attacks

expert9 min readUpdated 2026-03-13

Techniques for determining whether specific data was used to train an AI model, including shadow model approaches, loss-based inference, LiRA, and practical implementation guidance.

membership-inference privacy statistical attacks

Membership Inference Attacks

Why Membership Inference Works

Signal	Training Data (Members)	Non-Training Data (Non-Members)
Loss / perplexity	Lower	Higher
Prediction confidence	Higher, sharper distribution	Lower, flatter distribution
Output consistency	More consistent across temperatures	More variable
Gradient norm	Smaller (already near optimum)	Larger

Attack Techniques

Loss-Based Inference (Threshold Attack)

The simplest membership inference: compute the model's loss on the target sample and compare it to a threshold. Training data typically has lower loss.

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
 
def compute_perplexity(model, tokenizer, text):
    """Compute per-token perplexity as a membership signal."""
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
    return torch.exp(outputs.loss).item()
 
def membership_inference_loss(model, tokenizer, target_text, threshold):
    """Simple threshold-based membership inference."""
    ppl = compute_perplexity(model, tokenizer, target_text)
    is_member = ppl < threshold  # Lower perplexity = likely member
    return {
        "text": target_text[:80],
        "perplexity": ppl,
        "predicted_member": is_member,
        "confidence": abs(ppl - threshold) / threshold,
    }
 
# Calibrate threshold on known members/non-members
model = AutoModelForCausalLM.from_pretrained("target-model")
tokenizer = AutoTokenizer.from_pretrained("target-model")
 
# Known training data samples (if available) for calibration
known_members_ppl = [compute_perplexity(model, tokenizer, t) for t in known_members]
known_nonmembers_ppl = [compute_perplexity(model, tokenizer, t) for t in known_nonmembers]
threshold = (np.mean(known_members_ppl) + np.mean(known_nonmembers_ppl)) / 2

Shadow Model Technique

The shadow model technique builds a classifier that distinguishes members from non-members by training on observable differences.

Train shadow models
Train multiple models (shadow models) on datasets where you control the membership. Use a similar architecture and training procedure to the target model. Each shadow model gives you labeled examples of member/non-member behavior.
Extract behavioral features
For each training and non-training sample, record the shadow model's behavior: output probabilities, loss, entropy, top-k predictions. These features form your training set for the attack classifier.
Train the attack classifier
Train a binary classifier (logistic regression, MLP, or random forest) on the extracted features. The classifier learns the statistical signature that distinguishes members from non-members.
Apply to target model
Extract the same behavioral features from the target model for the samples you want to classify. Run them through your trained attack classifier.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import numpy as np
 
def extract_features(model, tokenizer, text):
    """Extract behavioral features from a model's response to a text."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        logits = outputs.logits
 
    probs = torch.softmax(logits, dim=-1)
    top_probs = probs.max(dim=-1).values.mean().item()
    entropy = -(probs * probs.log()).sum(dim=-1).mean().item()
    loss = outputs.loss.item()
 
    return [loss, top_probs, entropy]
 
# Phase 1: Train shadow models and collect features
shadow_features, shadow_labels = [], []
for shadow_model in shadow_models:
    for text in shadow_training_data:
        features = extract_features(shadow_model, tokenizer, text)
        shadow_features.append(features)
        shadow_labels.append(1)  # Member
    for text in shadow_nontraining_data:
        features = extract_features(shadow_model, tokenizer, text)
        shadow_features.append(features)
        shadow_labels.append(0)  # Non-member
 
# Phase 2: Train attack classifier
attack_clf = LogisticRegression()
attack_clf.fit(shadow_features, shadow_labels)
 
# Phase 3: Classify target samples
target_features = [extract_features(target_model, tokenizer, t) for t in target_texts]
predictions = attack_clf.predict_proba(target_features)[:, 1]

Likelihood Ratio Attack (LiRA)

Technique	Accuracy	Cost	Requirements
Loss threshold	~60-70%	Very low	Query access only
Shadow models	~70-85%	Medium	Must train shadow models
LiRA	~80-95%	High	Must train many reference models
Gradient-based	~75-90%	Medium	White-box access required

def calibrated_membership_score(target_model, ref_model, tokenizer, text):
    """Compare target model confidence against a reference model."""
    target_ppl = compute_perplexity(target_model, tokenizer, text)
    ref_ppl = compute_perplexity(ref_model, tokenizer, text)
 
    # Ratio: if target_ppl << ref_ppl, the target model is
    # unusually confident on this sample (suggesting membership)
    ratio = target_ppl / ref_ppl
    return {
        "text": text[:80],
        "target_ppl": target_ppl,
        "reference_ppl": ref_ppl,
        "ratio": ratio,
        "likely_member": ratio < 0.7,  # Calibrate this threshold
    }

Evaluating Attack Effectiveness

Metrics

Metric	What It Measures	Why It Matters
AUC-ROC	Overall discrimination ability	General attack quality across all thresholds
TPR at low FPR	True positive rate when false positives are rare (e.g., FPR = 0.1%)	Critical for practical attacks: you need high confidence to claim membership
Precision at recall	How many "member" predictions are correct at a given recall	Determines whether findings are actionable for reporting

Privacy Implications

Successful membership inference has concrete privacy consequences:

Medical datasets -- Confirming that an individual's health record was used to train a diagnostic model reveals their patient status at a particular institution
Financial data -- Confirming transaction data membership can reveal banking relationships
Legal discovery -- Membership inference can prove a company used data it claimed to have deleted
Regulatory compliance -- Under GDPR, an individual can ask whether their data was used for training; membership inference provides the technical answer

Privacy & Data Protection Attacks -- Overview and regulatory context
PII Extraction Techniques -- Complementary technique for extracting memorized content
Model Inversion Attacks -- Reconstructing training data from model outputs
Training Attacks -- Understanding training dynamics that affect memorization

Knowledge Check

References

Membership Inference Attacks From First Principles (Carlini et al., 2022) -- LiRA methodology
Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2017) -- Original shadow model technique
Loss and Likelihood Based Membership Inference (Yeom et al., 2018) -- Loss-based threshold attacks

Edit this page on GitHub

Membership Inference Attacks

Train shadow models

Extract behavioral features

Train the attack classifier

Apply to target model

Related articles

Membership Inference Attacks

Train shadow models

Extract behavioral features

Train the attack classifier

Apply to target model

Related articles