進階模型反演攻擊
進階6 分鐘閱讀更新於 2026-03-15
使用基於梯度的反演、生成式模型輔助重建,以及成員推論精煉,從模型權重與 API 存取重建訓練資料。
進階模型反演攻擊
進階模型反演超越最佳化單一輸入以最大化類別機率的基本 Fredrikson 式攻擊。現代技術利用來自模型權重的梯度資訊、使用生成式模型先驗以進行逼真重建,並串連多個攻擊原語以從僅透過 API 存取的模型萃取詳細訓練資料。
基於梯度的反演
當攻擊者能存取模型權重(白盒存取)時,基於梯度的反演直接最佳化輸入以匹配訓練樣本會產生的梯度資訊。
反演梯度(聯邦學習脈絡)
在聯邦學習中,客戶端分享梯度更新而非原始資料。梯度反演攻擊從這些共享梯度重建訓練資料:
import torch
import torch.nn as nn
def invert_gradient(model, shared_gradient, input_shape,
label=None, iterations=5000, lr=0.1,
tv_weight=1e-4):
"""
從共享梯度更新重建訓練資料。
model:模型架構
shared_gradient:客戶端共享的梯度更新
input_shape:要重建的輸入形狀
label:若已知,訓練樣本的標籤
"""
dummy_input = torch.randn(input_shape, requires_grad=True)
if label is None:
dummy_label = torch.randn(model.num_classes, requires_grad=True)
else:
dummy_label = label
optimizer = torch.optim.LBFGS([dummy_input], lr=lr)
for iteration in range(iterations):
def closure():
optimizer.zero_grad()
dummy_output = model(dummy_input)
if isinstance(dummy_label, torch.Tensor) and dummy_label.requires_grad:
dummy_loss = nn.functional.cross_entropy(
dummy_output, torch.softmax(dummy_label, dim=0).unsqueeze(0)
)
else:
dummy_loss = nn.functional.cross_entropy(
dummy_output.unsqueeze(0),
torch.tensor([dummy_label])
)
dummy_gradient = torch.autograd.grad(
dummy_loss, model.parameters(), create_graph=True
)
grad_diff = 0
for dg, sg in zip(dummy_gradient, shared_gradient):
grad_diff += (dg - sg).pow(2).sum()
if len(input_shape) == 4: # 影像輸入
tv_loss = total_variation(dummy_input)
total_loss = grad_diff + tv_weight * tv_loss
else:
total_loss = grad_diff
total_loss.backward()
return total_loss
optimizer.step(closure)
dummy_input.data = torch.clamp(dummy_input.data, 0, 1)
return dummy_input.detach()
def total_variation(images):
"""影像平滑度的總變異損失。"""
diff_h = images[:, :, 1:, :] - images[:, :, :-1, :]
diff_w = images[:, :, :, 1:] - images[:, :, :, :-1]
return diff_h.abs().sum() + diff_w.abs().sum()批次梯度反演
真實聯邦學習情境涉及批次梯度(跨多個樣本平均),使重建更困難:
def batch_gradient_inversion(model, shared_gradient, batch_size,
input_shape, iterations=10000):
"""
從批次梯度重建多個訓練樣本。
比單樣本反演顯著困難。
"""
dummy_batch = torch.randn(
(batch_size, *input_shape[1:]), requires_grad=True
)
dummy_labels = torch.randint(
0, model.num_classes, (batch_size,)
)
optimizer = torch.optim.Adam([dummy_batch], lr=0.001)
for iteration in range(iterations):
optimizer.zero_grad()
dummy_outputs = model(dummy_batch)
dummy_loss = nn.functional.cross_entropy(
dummy_outputs, dummy_labels
)
dummy_gradient = torch.autograd.grad(
dummy_loss, model.parameters(), create_graph=True
)
grad_diff = sum(
(dg - sg).pow(2).sum()
for dg, sg in zip(dummy_gradient, shared_gradient)
)
cosine_loss = 1 - sum(
nn.functional.cosine_similarity(
dg.flatten().unsqueeze(0),
sg.flatten().unsqueeze(0)
)
for dg, sg in zip(dummy_gradient, shared_gradient)
)
total_loss = grad_diff + 0.1 * cosine_loss
total_loss.backward()
optimizer.step()
dummy_batch.data = torch.clamp(dummy_batch.data, 0, 1)
return dummy_batch.detach()影響梯度反演品質的因素
| 因素 | 較佳重建 | 較差重建 |
|---|---|---|
| 批次大小 | 小 (1-8) | 大 (64+) |
| 影像解析度 | 低 (32x32) | 高 (224x224+) |
| 模型深度 | 淺網路 | 非常深網路 |
| 訓練階段 | 訓練早期(大梯度) | 訓練晚期(小梯度) |
| 梯度精度 | 完整精度 (FP32) | 量化或壓縮 |
| 正規化 | 總變異 + BN 比對 | 無正規化 |
生成式模型輔助反演
使用生成式模型 (GAN 或擴散模型) 作為先驗,透過將搜尋約束至逼真資料流形,大幅改善重建品質:
基於 GAN 的反演
class GANAssistedInversion:
"""使用預訓練 GAN 作為模型反演的先驗。"""
def __init__(self, target_model, generator, latent_dim=512):
self.target_model = target_model
self.generator = generator
self.latent_dim = latent_dim
def invert(self, target_class, iterations=2000, lr=0.01):
"""
尋找潛在向量 z 使 G(z) 以高信心被分類為
target_class,暗示 G(z) 類似該類別的訓練樣本。
"""
z = torch.randn(1, self.latent_dim, requires_grad=True)
optimizer = torch.optim.Adam([z], lr=lr)
for i in range(iterations):
optimizer.zero_grad()
generated = self.generator(z)
logits = self.target_model(generated)
class_loss = -logits[0, target_class]
# 身分損失:最大化重建獨特性
identity_loss = -logits.max() + logits.logsumexp(dim=-1)
# 先驗損失:使 z 接近標準常態
prior_loss = z.pow(2).sum() * 0.001
total_loss = class_loss + 0.5 * identity_loss + prior_loss
total_loss.backward()
optimizer.step()
return self.generator(z).detach()
def diverse_inversion(self, target_class, num_samples=10):
"""
透過鼓勵潛在空間多樣性產生多個多元重建。
"""
z_vectors = [
torch.randn(1, self.latent_dim, requires_grad=True)
for _ in range(num_samples)
]
optimizer = torch.optim.Adam(z_vectors, lr=0.01)
for iteration in range(2000):
optimizer.zero_grad()
total_loss = 0
generated_images = []
for z in z_vectors:
img = self.generator(z)
generated_images.append(img)
logits = self.target_model(img)
total_loss -= logits[0, target_class]
# 多樣性損失:懲罰相似重建
for i in range(len(generated_images)):
for j in range(i + 1, len(generated_images)):
similarity = nn.functional.cosine_similarity(
generated_images[i].flatten().unsqueeze(0),
generated_images[j].flatten().unsqueeze(0)
)
total_loss += similarity
total_loss.backward()
optimizer.step()
return [self.generator(z).detach() for z in z_vectors]擴散模型輔助反演
擴散模型提供比 GAN 更強的先驗,供高保真重建:
class DiffusionAssistedInversion:
"""使用預訓練擴散模型進行模型反演。"""
def __init__(self, target_model, diffusion_model, scheduler):
self.target_model = target_model
self.diffusion = diffusion_model
self.scheduler = scheduler
def invert(self, target_class, guidance_scale=7.5,
num_steps=50, num_candidates=4):
"""使用分類器引導擴散產生重建。"""
candidates = []
for _ in range(num_candidates):
x = torch.randn(1, 3, 256, 256)
for t in self.scheduler.timesteps[:num_steps]:
with torch.enable_grad():
x_input = x.clone().requires_grad_(True)
noise_pred = self.diffusion(x_input, t)
# 分類器引導:朝目標類別引導
logits = self.target_model(x_input)
class_score = logits[0, target_class]
grad = torch.autograd.grad(class_score, x_input)[0]
guided_noise = noise_pred - guidance_scale * grad
x = self.scheduler.step(guided_noise, t, x).prev_sample
candidates.append(x.detach())
return candidates基於 API 的反演(黑箱)
當只有 API 存取可用時,反演必須依賴信心分數與輸出分析:
基於信心的重建
class APIModelInversion:
"""僅使用 API 信心分數重建訓練資料。"""
def __init__(self, api_client, generator, target_class):
self.api = api_client
self.generator = generator
self.target_class = target_class
def invert(self, iterations=1000):
"""最佳化生成樣本以最大化 API 信心。使用生成器作為可微分代理。"""
z = torch.randn(1, 512, requires_grad=True)
optimizer = torch.optim.Adam([z], lr=0.01)
for i in range(iterations):
with torch.no_grad():
candidate = self.generator(z)
candidate_np = candidate.squeeze().permute(1, 2, 0).numpy()
api_response = self.api.classify(candidate_np)
target_confidence = api_response["probabilities"][self.target_class]
# 透過有限差分估計梯度
grad = self.estimate_gradient(z, target_confidence)
z.data -= optimizer.param_groups[0]["lr"] * grad
return self.generator(z).detach()
def estimate_gradient(self, z, base_score, epsilon=0.01):
"""透過有限差分估計梯度(需要許多 API 呼叫)。"""
grad = torch.zeros_like(z)
for i in range(z.shape[1]):
z_plus = z.clone()
z_plus[0, i] += epsilon
candidate = self.generator(z_plus)
candidate_np = candidate.squeeze().permute(1, 2, 0).detach().numpy()
response = self.api.classify(candidate_np)
score_plus = response["probabilities"][self.target_class]
grad[0, i] = (score_plus - base_score) / epsilon
return grad文字模型反演
反演語言模型以萃取訓練文字需要不同技術:
符元層級萃取
def extract_training_text(model, tokenizer, prefix, max_tokens=200,
temperature=0.1, num_samples=50):
"""
透過以已知前綴提示並以低溫度取樣
嘗試萃取已記憶的訓練文字。
低溫度(近決定性)取樣更可能重現已記憶序列。
"""
extractions = []
for _ in range(num_samples):
input_ids = tokenizer.encode(prefix, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
input_ids,
max_new_tokens=max_tokens,
temperature=temperature,
do_sample=True,
top_k=10
)
generated_text = tokenizer.decode(
outputs[0][input_ids.shape[1]:],
skip_special_tokens=True
)
extractions.append(generated_text)
# 跨樣本尋找共同子序列
# (已記憶文字會一致出現)
return find_consensus_sequences(extractions)基於成員推論引導的反演
結合成員推論與生成取樣,將重建聚焦於實際訓練資料:
def membership_guided_inversion(target_model, generator,
membership_oracle,
target_class, budget=1000):
"""
產生候選並使用成員推論辨識哪些重建
最接近實際訓練資料。
"""
candidates = []
for _ in range(budget):
z = torch.randn(1, 512)
candidate = generator(z)
member_score = membership_oracle.score(
target_model, candidate, target_class
)
candidates.append({
"image": candidate,
"z": z,
"membership_score": member_score
})
# 依成員分數排序——最高分數最可能類似實際訓練資料
candidates.sort(key=lambda x: x["membership_score"], reverse=True)
return candidates[:10]防禦機制
模型信心遮蔽
透過四捨五入或閾值信心分數降低攻擊者可用的資訊:
def mask_confidence(probabilities, precision=2):
"""四捨五入機率以降低資訊洩漏。"""
rounded = [round(p, precision) for p in probabilities]
total = sum(rounded)
return [r / total for r in rounded]差分隱私訓練
DP-SGD 限制個別訓練樣本的影響,直接界定反演攻擊品質。
輸出擾動
對模型輸出加入校準噪聲,在維持效用的同時防止精確基於信心的重建。
相關主題
- Model Inversion — 基本模型反演概念
- PII Extraction — 從模型萃取個人資訊
- Federated Learning Attacks — 聯邦學習攻擊,包括梯度反演
Knowledge Check
在對聯邦學習的梯度反演攻擊中,哪個因素最強烈決定個別訓練樣本能否被重建?
參考文獻
- Zhu et al., "Deep Leakage from Gradients" (2019)
- Geiping et al., "Inverting Gradients -- How Easy Is It to Break Privacy in Federated Learning?" (2020)
- Zhang et al., "The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks" (2020)
- Carlini et al., "Extracting Training Data from Large Language Models" (2021)
- Fredrikson et al., "Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures" (2015)