模型 Inversion 攻擊s (Data 訓練)

Expert9 min readUpdated 2026-03-13

Techniques for reconstructing training data from model outputs including gradient-based inversion, generative model inversion, and privacy implications for image and language models.

model-inversion privacy reconstruction attacks

Model Inversion 攻擊

Model inversion reconstructs approximations of 訓練資料 from model outputs or gradients. Unlike data extraction (which recovers verbatim memorized content), inversion produces synthetic reconstructions that capture statistical properties -- and sometimes identifying features -- of the original 訓練資料.

How Model Inversion Works

The core idea: if a model has learned to map inputs to outputs, we can reverse the mapping. Given a known 輸出 (a class label, a confidence score, an 嵌入向量), optimize an 輸入 that maximizes 模型's confidence for that 輸出. The optimized 輸入 will share features with the 訓練資料模型 learned from.

攻擊 Paradigms

Directly optimize pixel values or 符元嵌入向量 using gradient descent to maximize 模型's 輸出 for a target class. The optimization objective is: find 輸入 x* that maximizes P(target_class | x*). Without a prior, reconstructions are noisy but can reveal aggregate features (average face shape, skin tone, hair style for a face recognition class).

Use a pretrained generative model (GAN, VAE, or diffusion model) as a regularizer. Instead of optimizing raw pixels, optimize in the latent space of the generator. This produces photorealistic reconstructions 因為 the generator constrains outputs to the natural image manifold.

Given access to intermediate 嵌入向量 or representations, invert them back to the 輸入 space. 這是 relevant for systems that expose 嵌入向量 (search APIs, retrieval systems) and for federated learning where gradients are shared.

Gradient-Based Inversion

Basic Optimization 攻擊

import torch
import torch.nn.functional as F
 
def model_inversion_basic(model, target_class, image_size=(3, 224, 224),
                          steps=2000, lr=0.01):
    """Reconstruct a representative 輸入 for a target class."""
    # Initialize with random noise
    x = torch.randn(1, *image_size, requires_grad=True)
    optimizer = torch.optim.Adam([x], lr=lr)
 
    for step in range(steps):
        optimizer.zero_grad()
        輸出 = model(x)
 
        # Maximize probability of target class
        loss = -F.log_softmax(輸出, dim=1)[0, target_class]
 
        # Regularization: total variation for smoothness
        tv_loss = torch.sum(torch.abs(x[:, :, :, :-1] - x[:, :, :, 1:])) + \
                  torch.sum(torch.abs(x[:, :, :-1, :] - x[:, :, 1:, :]))
 
        total_loss = loss + 0.001 * tv_loss
        total_loss.backward()
        optimizer.step()
 
        # Clamp to valid pixel range
        with torch.no_grad():
            x.clamp_(0, 1)
 
    return x.detach()

Federated Learning Gradient Inversion

In federated learning, clients share gradient updates rather than raw data. 然而, gradients contain enough information to reconstruct the 訓練資料 that produced them.

攻擊	Setting	Reconstruction Quality	Key Reference
DLG (Deep Leakage from Gradients)	Single batch, white-box	High for small batches	Zhu et al., 2019
Inverting Gradients	Improved DLG with cosine similarity	Better convergence, larger batches	Geiping et al., 2020
LAMP	Language model gradients	Can recover text sequences	Balunovic et al., 2022
Fishing for User Data	Malicious server can amplify gradient signal	Near-perfect reconstruction	Wen et al., 2022

def gradient_inversion(model, target_gradient, batch_size=1,
                       image_size=(3, 32, 32), steps=5000):
    """Reconstruct 訓練資料 from shared gradients."""
    # Initialize dummy data and labels
    dummy_x = torch.randn(batch_size, *image_size, requires_grad=True)
    dummy_y = torch.randint(0, 10, (batch_size,))
 
    optimizer = torch.optim.LBFGS([dummy_x], lr=0.1)
 
    for step in range(steps):
        def closure():
            optimizer.zero_grad()
            輸出 = model(dummy_x)
            loss = F.cross_entropy(輸出, dummy_y)
            dummy_gradient = torch.autograd.grad(loss, model.parameters(),
                                                  create_graph=True)
 
            # Minimize distance between dummy and target gradients
            grad_loss = sum(
                ((dg - tg) ** 2).sum()
                for dg, tg in zip(dummy_gradient, target_gradient)
            )
            grad_loss.backward()
            return grad_loss
 
        optimizer.step(closure)
 
    return dummy_x.detach()

Generative Model Inversion

Using a pretrained generator as a prior dramatically improves reconstruction quality. Instead of optimizing raw pixels, optimize in the generator's latent space.

GAN-Based Inversion

Select a pretrained GAN
Use a generator trained on the same domain as the target model (e.g., StyleGAN for face recognition models). The generator constrains reconstructions to look like realistic images.
Optimize in latent space
Instead of optimizing pixel values, optimize the latent vector z that is 輸入 to the generator. The objective: find z* such that the target model assigns maximum confidence to the target class for G(z*).
Refine with multiple initializations
Run the optimization from multiple random starting points. Select the reconstruction with the highest target model confidence.
Post-process and 評估
Compare reconstructions against known 訓練資料 (if available) to 評估 how much identifying information was recovered.

def gan_inversion(target_model, generator, target_class,
                  latent_dim=512, steps=3000, n_starts=5):
    """Model inversion using a GAN prior for realistic reconstructions."""
    best_image = None
    best_conf = -float('inf')
 
    for _ in range(n_starts):
        z = torch.randn(1, latent_dim, requires_grad=True)
        optimizer = torch.optim.Adam([z], lr=0.02)
 
        for step in range(steps):
            optimizer.zero_grad()
            fake_image = generator(z)
            輸出 = target_model(fake_image)
 
            conf = F.softmax(輸出, dim=1)[0, target_class]
            loss = -torch.log(conf)
            loss.backward()
            optimizer.step()
 
        final_image = generator(z).detach()
        final_conf = F.softmax(target_model(final_image), dim=1)[0, target_class].item()
 
        if final_conf > best_conf:
            best_conf = final_conf
            best_image = final_image
 
    return best_image, best_conf

Language Model Inversion

Text Reconstruction from Embeddings

When systems expose text 嵌入向量 (search APIs, retrieval systems), attackers can train decoder models to reconstruct the original text from its 嵌入向量.

嵌入向量 Source	Inversion Quality	Why
Sentence 嵌入向量 (e.g., E5, GTE)	Moderate -- captures meaning, loses exact wording	High-dimensional but lossy compression
Token-level 嵌入向量	High -- preserves most 符元 information	Near-lossless representation of 輸入
Pooled hidden states	Moderate to High	Depends on pooling strategy
API-returned 嵌入向量	Variable	Depends on dimensionality and normalization

Inversion via Logits

For language models that return per-符元 logit distributions, the full distribution leaks more information than a single completion. 攻擊者 can use the logit distribution to infer properties of the 訓練資料 distribution at each position.

Assessing Inversion Risk

Factor	Higher Risk	Lower Risk
Model type	Face recognition, medical imaging	General-purpose classifiers
輸出 granularity	Full probability distribution, 嵌入向量	Top-1 label only
訓練資料 diversity	Low diversity (narrow domain)	High diversity (broad domain)
Model capacity	Overparameterized models	Well-regularized models
Access level	White-box (gradients available)	Black-box (label-only)

防禦 and Their Limitations

防禦	Mechanism	Limitation
Prediction purification	Return only top-k labels without confidence scores	Reduces gradient signal but does not eliminate it; still vulnerable to label-only attacks
Differential privacy	Add noise during 訓練	Must be strong (small epsilon) to prevent inversion; significantly degrades utility
Gradient compression	Compress or sparsify gradients before sharing (federated learning)	Moderate compression insufficient; heavy compression reduces model quality
Model distillation	Deploy a distilled model that generalizes better and memorizes less	Distilled models can still be inverted if the student capacity is high

參考文獻

The Secret Revealer: Generative Model-Inversion 攻擊 Against Deep Neural Networks (Zhang et al., 2020) -- GAN-based model inversion
Deep Leakage from Gradients (Zhu et al., 2019) -- Foundational gradient inversion work
Plug & Play 攻擊: Towards Robust and Flexible Model Inversion 攻擊 (Struppek et al., 2022) -- Advanced GAN inversion techniques

模型 Inversion 攻擊s (Data 訓練)

Expert9 min readUpdated 2026-03-13

Techniques for reconstructing training data from model outputs including gradient-based inversion, generative model inversion, and privacy implications for image and language models.

model-inversion privacy reconstruction attacks

Model Inversion 攻擊

How Model Inversion Works

攻擊 Paradigms

Gradient-Based Inversion

Basic Optimization 攻擊

import torch
import torch.nn.functional as F
 
def model_inversion_basic(model, target_class, image_size=(3, 224, 224),
                          steps=2000, lr=0.01):
    """Reconstruct a representative 輸入 for a target class."""
    # Initialize with random noise
    x = torch.randn(1, *image_size, requires_grad=True)
    optimizer = torch.optim.Adam([x], lr=lr)
 
    for step in range(steps):
        optimizer.zero_grad()
        輸出 = model(x)
 
        # Maximize probability of target class
        loss = -F.log_softmax(輸出, dim=1)[0, target_class]
 
        # Regularization: total variation for smoothness
        tv_loss = torch.sum(torch.abs(x[:, :, :, :-1] - x[:, :, :, 1:])) + \
                  torch.sum(torch.abs(x[:, :, :-1, :] - x[:, :, 1:, :]))
 
        total_loss = loss + 0.001 * tv_loss
        total_loss.backward()
        optimizer.step()
 
        # Clamp to valid pixel range
        with torch.no_grad():
            x.clamp_(0, 1)
 
    return x.detach()

Federated Learning Gradient Inversion

In federated learning, clients share gradient updates rather than raw data. 然而, gradients contain enough information to reconstruct the 訓練資料 that produced them.

攻擊	Setting	Reconstruction Quality	Key Reference
DLG (Deep Leakage from Gradients)	Single batch, white-box	High for small batches	Zhu et al., 2019
Inverting Gradients	Improved DLG with cosine similarity	Better convergence, larger batches	Geiping et al., 2020
LAMP	Language model gradients	Can recover text sequences	Balunovic et al., 2022
Fishing for User Data	Malicious server can amplify gradient signal	Near-perfect reconstruction	Wen et al., 2022

def gradient_inversion(model, target_gradient, batch_size=1,
                       image_size=(3, 32, 32), steps=5000):
    """Reconstruct 訓練資料 from shared gradients."""
    # Initialize dummy data and labels
    dummy_x = torch.randn(batch_size, *image_size, requires_grad=True)
    dummy_y = torch.randint(0, 10, (batch_size,))
 
    optimizer = torch.optim.LBFGS([dummy_x], lr=0.1)
 
    for step in range(steps):
        def closure():
            optimizer.zero_grad()
            輸出 = model(dummy_x)
            loss = F.cross_entropy(輸出, dummy_y)
            dummy_gradient = torch.autograd.grad(loss, model.parameters(),
                                                  create_graph=True)
 
            # Minimize distance between dummy and target gradients
            grad_loss = sum(
                ((dg - tg) ** 2).sum()
                for dg, tg in zip(dummy_gradient, target_gradient)
            )
            grad_loss.backward()
            return grad_loss
 
        optimizer.step(closure)
 
    return dummy_x.detach()

Generative Model Inversion

Using a pretrained generator as a prior dramatically improves reconstruction quality. Instead of optimizing raw pixels, optimize in the generator's latent space.

GAN-Based Inversion

Select a pretrained GAN
Use a generator trained on the same domain as the target model (e.g., StyleGAN for face recognition models). The generator constrains reconstructions to look like realistic images.
Optimize in latent space
Instead of optimizing pixel values, optimize the latent vector z that is 輸入 to the generator. The objective: find z* such that the target model assigns maximum confidence to the target class for G(z*).
Refine with multiple initializations
Run the optimization from multiple random starting points. Select the reconstruction with the highest target model confidence.
Post-process and 評估
Compare reconstructions against known 訓練資料 (if available) to 評估 how much identifying information was recovered.

def gan_inversion(target_model, generator, target_class,
                  latent_dim=512, steps=3000, n_starts=5):
    """Model inversion using a GAN prior for realistic reconstructions."""
    best_image = None
    best_conf = -float('inf')
 
    for _ in range(n_starts):
        z = torch.randn(1, latent_dim, requires_grad=True)
        optimizer = torch.optim.Adam([z], lr=0.02)
 
        for step in range(steps):
            optimizer.zero_grad()
            fake_image = generator(z)
            輸出 = target_model(fake_image)
 
            conf = F.softmax(輸出, dim=1)[0, target_class]
            loss = -torch.log(conf)
            loss.backward()
            optimizer.step()
 
        final_image = generator(z).detach()
        final_conf = F.softmax(target_model(final_image), dim=1)[0, target_class].item()
 
        if final_conf > best_conf:
            best_conf = final_conf
            best_image = final_image
 
    return best_image, best_conf

Language Model Inversion

Text Reconstruction from Embeddings

When systems expose text 嵌入向量 (search APIs, retrieval systems), attackers can train decoder models to reconstruct the original text from its 嵌入向量.

嵌入向量 Source	Inversion Quality	Why
Sentence 嵌入向量 (e.g., E5, GTE)	Moderate -- captures meaning, loses exact wording	High-dimensional but lossy compression
Token-level 嵌入向量	High -- preserves most 符元 information	Near-lossless representation of 輸入
Pooled hidden states	Moderate to High	Depends on pooling strategy
API-returned 嵌入向量	Variable	Depends on dimensionality and normalization

Inversion via Logits

Assessing Inversion Risk

Factor	Higher Risk	Lower Risk
Model type	Face recognition, medical imaging	General-purpose classifiers
輸出 granularity	Full probability distribution, 嵌入向量	Top-1 label only
訓練資料 diversity	Low diversity (narrow domain)	High diversity (broad domain)
Model capacity	Overparameterized models	Well-regularized models
Access level	White-box (gradients available)	Black-box (label-only)

防禦 and Their Limitations

防禦	Mechanism	Limitation
Prediction purification	Return only top-k labels without confidence scores	Reduces gradient signal but does not eliminate it; still vulnerable to label-only attacks
Differential privacy	Add noise during 訓練	Must be strong (small epsilon) to prevent inversion; significantly degrades utility
Gradient compression	Compress or sparsify gradients before sharing (federated learning)	Moderate compression insufficient; heavy compression reduces model quality
Model distillation	Deploy a distilled model that generalizes better and memorizes less	Distilled models can still be inverted if the student capacity is high

參考文獻

The Secret Revealer: Generative Model-Inversion 攻擊 Against Deep Neural Networks (Zhang et al., 2020) -- GAN-based model inversion
Deep Leakage from Gradients (Zhu et al., 2019) -- Foundational gradient inversion work
Plug & Play 攻擊: Towards Robust and Flexible Model Inversion 攻擊 (Struppek et al., 2022) -- Advanced GAN inversion techniques

模型 Inversion 攻擊s (Data 訓練)

Model Inversion 攻擊

How Model Inversion Works

攻擊 Paradigms

Gradient-Based Inversion

Basic Optimization 攻擊

Federated Learning Gradient Inversion

Generative Model Inversion

GAN-Based Inversion

Select a pretrained GAN

Optimize in latent space

Refine with multiple initializations

Post-process and 評估

Language Model Inversion

Text Reconstruction from Embeddings

Inversion via Logits

Assessing Inversion Risk

防禦 and Their Limitations

相關主題

參考文獻

模型 Inversion 攻擊s (Data 訓練)

Model Inversion 攻擊

How Model Inversion Works

攻擊 Paradigms

Gradient-Based Inversion

Basic Optimization 攻擊

Federated Learning Gradient Inversion

Generative Model Inversion

GAN-Based Inversion

Select a pretrained GAN

Optimize in latent space

Refine with multiple initializations

Post-process and 評估

Language Model Inversion

Text Reconstruction from Embeddings

Inversion via Logits

Assessing Inversion Risk

防禦 and Their Limitations

相關主題

參考文獻

模型 Inversion 攻擊s (Data 訓練)

Select a pretrained GAN

Optimize in latent space

Refine with multiple initializations

Post-process and 評估

Related articles

模型 Inversion 攻擊s (Data 訓練)

Select a pretrained GAN

Optimize in latent space

Refine with multiple initializations

Post-process and 評估

Related articles