Membership Inference via Embeddings
Determining if specific data was in an embedding model's training set through distance-based inference, statistical tests, and embedding behavior analysis.
Membership inference attacks determine whether a specific piece of data was included in an embedding model's training set. This is a privacy-relevant question: it can reveal whether an organization's proprietary data was used to train a commercial model without consent, whether an individual's personal data is present in a system, or whether a dataset was used in violation of licensing terms.
Why Membership Inference Works
Embedding models, like all machine learning models, behave differently on data they have seen during training compared to data they have not. This behavioral difference creates a signal that can be detected through careful analysis.
The Overfitting Signal
Models tend to overfit slightly to their training data, producing more "confident" representations for training examples. For embedding models, this manifests as:
- Lower reconstruction loss — The model encodes training data more efficiently
- Higher self-similarity — Embeddings of training data are more consistent across perturbations
- Cluster proximity — Training examples embed closer to cluster centers than non-training examples
- Dimensional concentration — Training data produces embeddings that are more concentrated in certain dimensions
The Memorization Signal
Large models memorize portions of their training data. For embedding models, memorized data produces embeddings with distinctive properties:
- More extreme values in specific dimensions
- Higher cosine similarity to embeddings of related training examples
- More predictable embedding behavior under input perturbation
Distance-Based Inference
The simplest membership inference approach compares the embedding distance between the candidate text and known reference points.
Self-Similarity Test
Embed the candidate text and slight perturbations of it. Training data tends to have more consistent (self-similar) embeddings across perturbations:
def self_similarity_test(text, model, num_perturbations=50):
"""Test if text was in training data via self-similarity."""
original_embedding = model.encode(text)
perturbation_similarities = []
for _ in range(num_perturbations):
# Create a slight perturbation (typo, word swap, synonym)
perturbed = create_perturbation(text)
perturbed_embedding = model.encode(perturbed)
similarity = cosine_similarity(original_embedding, perturbed_embedding)
perturbation_similarities.append(similarity)
# Training data tends to have higher average similarity
# (more robust to perturbations)
avg_similarity = np.mean(perturbation_similarities)
std_similarity = np.std(perturbation_similarities)
return {
'mean_similarity': avg_similarity,
'std_similarity': std_similarity,
'likely_member': avg_similarity > THRESHOLD # Calibrate threshold
}Nearest-Neighbor Distance
Training data tends to be closer to its nearest neighbors in the embedding space than non-training data:
def nearest_neighbor_test(text, model, reference_corpus):
"""Compare nearest-neighbor distance for member vs non-member data."""
target_embedding = model.encode(text)
reference_embeddings = model.encode(reference_corpus)
# Compute distance to nearest reference point
distances = [
1 - cosine_similarity(target_embedding, ref)
for ref in reference_embeddings
]
min_distance = min(distances)
# Training data tends to have smaller nearest-neighbor distances
return min_distanceStatistical Tests
More rigorous approaches use statistical tests to distinguish member and non-member embeddings.
Likelihood Ratio Test
Compare the likelihood of the embedding under a "member" model and a "non-member" model:
def likelihood_ratio_test(embedding, member_distribution, nonmember_distribution):
"""Compute likelihood ratio for membership inference."""
# Fit distributions to known member and non-member embeddings
member_likelihood = member_distribution.log_prob(embedding)
nonmember_likelihood = nonmember_distribution.log_prob(embedding)
ratio = member_likelihood - nonmember_likelihood
return ratio # Positive = likely memberCalibrated Confidence Test
Train a classifier to distinguish member and non-member embeddings using known examples:
from sklearn.ensemble import RandomForestClassifier
def train_membership_classifier(known_members, known_nonmembers, model):
"""Train a classifier to predict membership."""
member_embeddings = [model.encode(t) for t in known_members]
nonmember_embeddings = [model.encode(t) for t in known_nonmembers]
X = np.vstack(member_embeddings + nonmember_embeddings)
y = [1] * len(member_embeddings) + [0] * len(nonmember_embeddings)
# Extract features from embeddings
features = extract_embedding_features(X) # Norm, dimension stats, etc.
clf = RandomForestClassifier(n_estimators=100)
clf.fit(features, y)
return clf
def predict_membership(text, model, classifier):
embedding = model.encode(text)
features = extract_embedding_features(embedding.reshape(1, -1))
probability = classifier.predict_proba(features)[0][1]
return probabilityFeature Engineering for Membership Inference
The features used for classification include:
| Feature | Description | Member Signal |
|---|---|---|
| Embedding norm | L2 norm of the vector | Members may have more consistent norms |
| Dimensional variance | Variance across dimensions | Members may show lower variance |
| Kurtosis | Peakedness of dimension distribution | Members may show higher kurtosis |
| Self-similarity | Similarity across perturbations | Members show higher self-similarity |
| Reconstruction error | Error when reconstructing via autoencoder | Members have lower reconstruction error |
| Nearest-neighbor distance | Distance to closest reference point | Members are closer to neighbors |
Practical Scenarios
Detecting Unauthorized Data Use
An organization suspects that a commercial embedding service trained on their proprietary documents. They can test membership inference on known proprietary texts:
# Test whether proprietary documents were used in training
proprietary_texts = load_proprietary_documents()
public_texts = load_public_documents_same_domain()
# Compare membership scores
proprietary_scores = [membership_score(t, target_model) for t in proprietary_texts]
public_scores = [membership_score(t, target_model) for t in public_texts]
# Statistical test: are proprietary scores significantly higher?
from scipy.stats import mannwhitneyu
statistic, p_value = mannwhitneyu(proprietary_scores, public_scores, alternative='greater')
print(f"p-value: {p_value}") # Low p-value suggests proprietary data was usedIndividual Data Presence
Determining whether a specific individual's data is present in an AI system has implications for data subject access requests under GDPR:
# Test if a specific individual's emails are in the training data
individual_texts = [
"Meeting with Sarah regarding Q3 projections",
"Please find attached the revised contract for Project Atlas",
# ... more known texts from the individual
]
for text in individual_texts:
score = membership_score(text, target_model)
print(f"Membership score: {score:.4f} - {text[:50]}...")Training Data Auditing
Model providers can use membership inference as an auditing tool to verify that their training pipeline correctly excluded opted-out or restricted data.
Limitations and Caveats
Membership inference on embedding models has several important limitations:
False positive rate. Texts that are very similar to training data (same domain, similar phrasing) may score as members even if they were not in the training set. The technique detects similarity to training data, not exact membership.
Model size dependency. Larger models memorize more and are more susceptible to membership inference. Smaller models may not show a detectable membership signal.
Training data diversity. Models trained on highly diverse data are harder to probe because the membership signal is weaker when no single example is seen many times.
Post-training processing. Quantization, distillation, and fine-tuning can alter the membership signal, reducing the accuracy of inference.
Related Topics
- Inversion Attacks — Recovering content rather than just detecting presence
- Adversarial Embeddings — Manipulating embeddings rather than analyzing them
- Training Pipeline Attacks — Broader context of training data security