進階 模型 Inversion 攻擊s
Reconstructing training data from model weights and API access using gradient-based inversion, generative model-assisted reconstruction, and membership inference refinement.
Advanced Model Inversion 攻擊
Advanced model inversion goes beyond the basic Fredrikson-style attacks that optimize a single 輸入 to maximize a class probability. Modern techniques leverage gradient information from model weights, use generative model priors for realistic reconstruction, and chain multiple attack primitives to extract detailed 訓練資料 from models accessible only through APIs.
Gradient-Based Inversion
When 攻擊者 has access to model weights (white-box access), gradient-based inversion directly optimizes an 輸入 to match the gradient information that a 訓練 sample would produce.
Inverting Gradients (Federated Learning Context)
In federated learning, clients share gradient updates rather than raw data. Gradient inversion attacks reconstruct the 訓練資料 from these shared gradients:
import torch
import torch.nn as nn
def invert_gradient(model, shared_gradient, input_shape,
label=None, iterations=5000, lr=0.1,
tv_weight=1e-4):
"""
Reconstruct 訓練資料 from a shared gradient update.
model: 模型 architecture
shared_gradient: gradient update shared by the client
input_shape: shape of the 輸入 to reconstruct
label: if known, the label of the 訓練 sample
"""
# Initialize random 輸入 and optionally optimize label
dummy_input = torch.randn(input_shape, requires_grad=True)
if label is None:
dummy_label = torch.randn(model.num_classes, requires_grad=True)
else:
dummy_label = label
optimizer = torch.optim.LBFGS([dummy_input], lr=lr)
for iteration in range(iterations):
def closure():
optimizer.zero_grad()
# Forward pass with dummy 輸入
dummy_output = model(dummy_input)
if isinstance(dummy_label, torch.Tensor) and dummy_label.requires_grad:
dummy_loss = nn.functional.cross_entropy(
dummy_output, torch.softmax(dummy_label, dim=0).unsqueeze(0)
)
else:
dummy_loss = nn.functional.cross_entropy(
dummy_output.unsqueeze(0),
torch.tensor([dummy_label])
)
# Compute gradient of dummy loss
dummy_gradient = torch.autograd.grad(
dummy_loss, model.parameters(), create_graph=True
)
# Match dummy gradient to shared gradient
grad_diff = 0
for dg, sg in zip(dummy_gradient, shared_gradient):
grad_diff += (dg - sg).pow(2).sum()
# Total variation regularization for image quality
if len(input_shape) == 4: # Image 輸入
tv_loss = total_variation(dummy_input)
total_loss = grad_diff + tv_weight * tv_loss
else:
total_loss = grad_diff
total_loss.backward()
return total_loss
optimizer.step(closure)
# Clamp to valid range
dummy_input.data = torch.clamp(dummy_input.data, 0, 1)
return dummy_input.detach()
def total_variation(images):
"""Total variation loss for image smoothness."""
diff_h = images[:, :, 1:, :] - images[:, :, :-1, :]
diff_w = images[:, :, :, 1:] - images[:, :, :, :-1]
return diff_h.abs().sum() + diff_w.abs().sum()Batch Gradient Inversion
Real federated learning scenarios involve batch gradients (averaged over multiple samples), making reconstruction harder:
def batch_gradient_inversion(model, shared_gradient, batch_size,
input_shape, iterations=10000):
"""
Reconstruct multiple 訓練 samples from a batch gradient.
Significantly harder than single-sample inversion.
"""
# Initialize batch of dummy inputs
dummy_batch = torch.randn(
(batch_size, *input_shape[1:]), requires_grad=True
)
dummy_labels = torch.randint(
0, model.num_classes, (batch_size,)
)
optimizer = torch.optim.Adam([dummy_batch], lr=0.001)
for iteration in range(iterations):
optimizer.zero_grad()
# Forward pass with dummy batch
dummy_outputs = model(dummy_batch)
dummy_loss = nn.functional.cross_entropy(
dummy_outputs, dummy_labels
)
# Compute batch gradient
dummy_gradient = torch.autograd.grad(
dummy_loss, model.parameters(), create_graph=True
)
# Gradient matching loss
grad_diff = sum(
(dg - sg).pow(2).sum()
for dg, sg in zip(dummy_gradient, shared_gradient)
)
# Cosine similarity component (improves convergence)
cosine_loss = 1 - sum(
nn.functional.cosine_similarity(
dg.flatten().unsqueeze(0),
sg.flatten().unsqueeze(0)
)
for dg, sg in zip(dummy_gradient, shared_gradient)
)
total_loss = grad_diff + 0.1 * cosine_loss
total_loss.backward()
optimizer.step()
dummy_batch.data = torch.clamp(dummy_batch.data, 0, 1)
return dummy_batch.detach()Factors Affecting Gradient Inversion Quality
| Factor | Better Reconstruction | Worse Reconstruction |
|---|---|---|
| Batch size | Small (1-8) | Large (64+) |
| Image resolution | Low (32x32) | High (224x224+) |
| Model depth | Shallow networks | Very deep networks |
| Training stage | Early 訓練 (large gradients) | Late 訓練 (small gradients) |
| Gradient precision | Full precision (FP32) | Quantized or compressed |
| Regularization | Total variation + BN matching | No regularization |
Generative Model-Assisted Inversion
Using a generative model (GAN or diffusion model) as a prior dramatically improves reconstruction quality by constraining the search to the manifold of realistic data:
GAN-Based Inversion
class GANAssistedInversion:
"""Use a pre-trained GAN as a prior for model inversion."""
def __init__(self, target_model, generator, latent_dim=512):
self.target_model = target_model
self.generator = generator
self.latent_dim = latent_dim
def invert(self, target_class, iterations=2000, lr=0.01):
"""
Find a latent vector z such that G(z) is classified
as target_class with high confidence, suggesting G(z)
resembles a 訓練 sample of that class.
"""
z = torch.randn(1, self.latent_dim, requires_grad=True)
optimizer = torch.optim.Adam([z], lr=lr)
for i in range(iterations):
optimizer.zero_grad()
# Generate image from latent code
generated = self.generator(z)
# Classification loss: maximize confidence for target class
logits = self.target_model(generated)
class_loss = -logits[0, target_class]
# Identity loss: maximize distinctiveness of reconstruction
# (prevents convergence to class centroid)
identity_loss = -logits.max() + logits.logsumexp(dim=-1)
# Prior loss: keep z close to standard normal
prior_loss = z.pow(2).sum() * 0.001
total_loss = class_loss + 0.5 * identity_loss + prior_loss
total_loss.backward()
optimizer.step()
return self.generator(z).detach()
def diverse_inversion(self, target_class, num_samples=10):
"""
Generate multiple diverse reconstructions by
encouraging diversity in the latent space.
"""
z_vectors = [
torch.randn(1, self.latent_dim, requires_grad=True)
for _ in range(num_samples)
]
optimizer = torch.optim.Adam(z_vectors, lr=0.01)
for iteration in range(2000):
optimizer.zero_grad()
total_loss = 0
# Standard inversion loss 對每個 z
generated_images = []
for z in z_vectors:
img = self.generator(z)
generated_images.append(img)
logits = self.target_model(img)
total_loss -= logits[0, target_class]
# Diversity loss: penalize similar reconstructions
for i in range(len(generated_images)):
for j in range(i + 1, len(generated_images)):
similarity = nn.functional.cosine_similarity(
generated_images[i].flatten().unsqueeze(0),
generated_images[j].flatten().unsqueeze(0)
)
total_loss += similarity # Penalize high similarity
total_loss.backward()
optimizer.step()
return [self.generator(z).detach() for z in z_vectors]Diffusion Model-Assisted Inversion
Diffusion models provide stronger priors than GANs for high-fidelity reconstruction:
class DiffusionAssistedInversion:
"""Use a pre-trained diffusion model for model inversion."""
def __init__(self, target_model, diffusion_model, scheduler):
self.target_model = target_model
self.diffusion = diffusion_model
self.scheduler = scheduler
def invert(self, target_class, guidance_scale=7.5,
num_steps=50, num_candidates=4):
"""
Generate reconstructions using classifier-guided diffusion.
"""
candidates = []
for _ in range(num_candidates):
# Start from pure noise
x = torch.randn(1, 3, 256, 256)
for t in self.scheduler.timesteps[:num_steps]:
with torch.enable_grad():
x_input = x.clone().requires_grad_(True)
# Unconditional score from diffusion model
noise_pred = self.diffusion(x_input, t)
# Classifier guidance: steer toward target class
logits = self.target_model(x_input)
class_score = logits[0, target_class]
grad = torch.autograd.grad(class_score, x_input)[0]
# Guided noise prediction
guided_noise = noise_pred - guidance_scale * grad
x = self.scheduler.step(guided_noise, t, x).prev_sample
candidates.append(x.detach())
return candidatesAPI-Based Inversion (Black-Box)
When only API access is available, inversion must rely on confidence scores and 輸出 analysis:
Confidence-Based Reconstruction
class APIModelInversion:
"""Reconstruct 訓練資料 using only API confidence scores."""
def __init__(self, api_client, generator, target_class):
self.api = api_client
self.generator = generator
self.target_class = target_class
def invert(self, iterations=1000):
"""
Optimize a generated sample to maximize API confidence.
Uses a generator as a differentiable proxy.
"""
z = torch.randn(1, 512, requires_grad=True)
optimizer = torch.optim.Adam([z], lr=0.01)
for i in range(iterations):
# Generate candidate
with torch.no_grad():
candidate = self.generator(z)
candidate_np = candidate.squeeze().permute(1, 2, 0).numpy()
# Query API for confidence
api_response = self.api.classify(candidate_np)
target_confidence = api_response["probabilities"][self.target_class]
# Estimate gradient via finite differences
grad = self.estimate_gradient(z, target_confidence)
# Update latent vector
z.data -= optimizer.param_groups[0]["lr"] * grad
return self.generator(z).detach()
def estimate_gradient(self, z, base_score, epsilon=0.01):
"""Estimate gradient via finite differences (requires many API calls)."""
grad = torch.zeros_like(z)
for i in range(z.shape[1]):
z_plus = z.clone()
z_plus[0, i] += epsilon
candidate = self.generator(z_plus)
candidate_np = candidate.squeeze().permute(1, 2, 0).detach().numpy()
response = self.api.classify(candidate_np)
score_plus = response["probabilities"][self.target_class]
grad[0, i] = (score_plus - base_score) / epsilon
return gradText Model Inversion
Inverting language models to extract 訓練 text requires different techniques:
Token-Level Extraction
def extract_training_text(model, 分詞器, prefix, max_tokens=200,
temperature=0.1, num_samples=50):
"""
Attempt to extract memorized 訓練 text by prompting
with known prefixes and sampling with low temperature.
Low temperature (near-deterministic) sampling is more likely
to reproduce memorized sequences.
"""
extractions = []
for _ in range(num_samples):
input_ids = 分詞器.encode(prefix, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
input_ids,
max_new_tokens=max_tokens,
temperature=temperature,
do_sample=True,
top_k=10
)
generated_text = 分詞器.decode(
outputs[0][input_ids.shape[1]:],
skip_special_tokens=True
)
extractions.append(generated_text)
# Find common subsequences across samples
# (memorized text will appear consistently)
return find_consensus_sequences(extractions)Membership Inference-Guided Inversion
Combine membership 推論 with generative sampling to focus reconstruction on actual 訓練資料:
def membership_guided_inversion(target_model, generator,
membership_oracle,
target_class, budget=1000):
"""
Generate candidates and use membership 推論 to
識別 which reconstructions most closely resemble
actual 訓練資料.
"""
candidates = []
for _ in range(budget):
# Generate random reconstruction
z = torch.randn(1, 512)
candidate = generator(z)
# Check membership score
member_score = membership_oracle.score(
target_model, candidate, target_class
)
candidates.append({
"image": candidate,
"z": z,
"membership_score": member_score
})
# Sort by membership score -- highest scores most likely
# to resemble actual 訓練資料
candidates.sort(key=lambda x: x["membership_score"], reverse=True)
return candidates[:10] # Return top candidates防禦 Mechanisms
Model Confidence Masking
Reduce the information available to attackers by rounding or thresholding confidence scores:
def mask_confidence(probabilities, precision=2):
"""Round probabilities to reduce information leakage."""
rounded = [round(p, precision) for p in probabilities]
# Renormalize
total = sum(rounded)
return [r / total for r in rounded]Differential Privacy Training
DP-SGD limits the influence of individual 訓練 samples, directly bounding inversion attack quality.
輸出 Perturbation
Add calibrated noise to model outputs to prevent precise confidence-based reconstruction while maintaining utility.
相關主題
- Model Inversion — Basic model inversion concepts
- PII Extraction — Extracting personal information from models
- Federated Learning 攻擊 — 攻擊 on federated learning including gradient inversion
In gradient inversion attacks against federated learning, which factor most strongly determines whether individual 訓練 samples can be reconstructed?
參考文獻
- Zhu et al., "Deep Leakage from Gradients" (2019)
- Geiping et al., "Inverting Gradients -- How Easy Is It to Break Privacy in Federated Learning?" (2020)
- Zhang et al., "The Secret Revealer: Generative Model-Inversion 攻擊 Against Deep Neural Networks" (2020)
- Carlini et al., "Extracting Training Data from Large Language Models" (2021)
- Fredrikson et al., "Model Inversion 攻擊 that 利用 Confidence Information and Basic Countermeasures" (2015)