模型 Extraction & IP Theft
Methodology for black-box model extraction, API-based distillation, side-channel extraction, watermark removal, and model fingerprinting bypass targeting deployed AI systems.
Model Extraction & IP Theft
Model extraction attacks replicate the functionality of a proprietary AI model by querying its API and 訓練 a surrogate on the responses. For organizations that invested millions in 訓練, 模型 weights are core IP. 攻擊者 who extracts a functionally equivalent model can undercut pricing, bypass 安全 護欄, or build derivative products without 授權.
Extraction Methods
Model extraction techniques vary by 攻擊者's access level and the information channels available.
API-based extraction queries the target model's public interface and trains a surrogate on the responses. 攻擊者 sends diverse prompts, collects outputs (and logprobs if exposed), and trains a student model via knowledge distillation. Active learning selects queries where the student is most uncertain, reducing total queries by 3-10x. 這是 the most common extraction threat 因為 it requires only standard API access and is difficult to distinguish from legitimate usage.
Side-channel extraction infers model properties from observable signals beyond the intended API 輸出. Timing analysis reveals architecture details: consistent 符元 times indicate dense models while variable times suggest Mixture of Experts. First-符元 latency across prompt lengths reveals context processing characteristics. Cache timing attacks detect shared KV-cache infrastructure. 嵌入向量 endpoint queries recover the 嵌入向量 matrix, revealing dimensionality and enabling base model identification.
Knowledge distillation trains a smaller student model to reproduce the target's behavior using soft labels (probability distributions) rather than hard labels (text only). With logprob access, KL-divergence loss against the teacher's 符元 distributions provides rich 訓練 signal. Temperature scaling smooths distributions for better gradient flow. Multi-round active distillation iteratively identifies and fills gaps in the student's knowledge, achieving 85-95% task-specific fidelity with orders of magnitude less compute than the original 訓練.
Black-Box Extraction
Black-box extraction requires only API access -- no gradients, weights, or architecture knowledge needed.
Extraction Methodology
Select a student architecture
The student model does not need to match the target architecture. Smaller open-source models (e.g., Llama 3 8B) are common choices.
Generate diverse queries
Maximize information extracted per query by combining multiple strategies: domain-specific prompts, 對抗性 probing, chain-of-thought elicitation, format diversity, and edge cases.
Collect responses
If the API exposes logprobs, collect them -- they carry far more information than text-only responses. Also collect any top-k 符元 probabilities.
Train the student via knowledge distillation
Train on (query, response) pairs. With logprobs, use KL-divergence loss against the teacher's soft labels. Without logprobs, use standard cross-entropy on the text.
評估 fidelity
Measure how closely the student matches the target on held-out inputs. Iterate with additional targeted queries where fidelity is weakest.
Logprob 利用
The distillation loss with logprobs uses KL divergence between student and teacher distributions, with temperature scaling to smooth distributions for better gradient flow:
# KL divergence distillation loss
student_probs = log_softmax(student_logits / temperature)
teacher_probs = softmax(teacher_logprobs / temperature)
loss = kl_div(student_probs, teacher_probs) * (temperature ** 2)Query Efficiency: Active Learning
Use the student's own uncertainty to select the most informative next batch of queries. Rank candidate queries by the student model's 輸出 entropy -- high-entropy queries are where the student is most uncertain and will learn the most.
| Query Strategy | Purpose | Cost Efficiency |
|---|---|---|
| Random sampling | Baseline coverage | Low |
| Domain-specific | Cover target's specialty areas | Medium |
| Uncertainty-based (active) | Query where student is most uncertain | High (3-10x improvement) |
| 對抗性 probing | Find decision boundary regions | High |
| Multi-round iterative | Refine with each 訓練 round | Highest |
Side-Channel Extraction
Side channels leak model information through observable properties beyond the API's intended 輸出.
Timing Side Channels
| Signal | What It Reveals | 偵測 Method |
|---|---|---|
| Consistent time-per-符元 | Fixed-size model (no MoE) | Measure 符元 generation variance |
| Variable time-per-符元 | Likely Mixture of Experts | Check standard deviation of inter-符元 latency |
| Absolute 符元 time | Approximate model size | Benchmark against known model sizes |
| First-符元 latency | Prefill computation cost (context length sensitivity) | Measure time-to-first-符元 across prompt lengths |
Cache Timing 攻擊
LLM serving infrastructure often uses KV-cache optimization. If the cache is shared across requests, 攻擊者 can detect what other users are querying via timing differences.
# Detect shared KV-cache: send identical prefix twice
cold = measure_first_token_latency(api, prefix + " Continue:")
warm = measure_first_token_latency(api, prefix + " Continue:")
shared_cache = (cold / warm) > 1.5 # 50%+ speedup = cachingToken 嵌入向量 Recovery
If the API exposes an 嵌入向量 endpoint, querying with individual 符元 recovers the 嵌入向量 matrix. This reveals dimensionality (identifies the architecture), rank, and singular value spectrum -- enough to 識別 the base model and detect 微調 modifications.
Distillation 攻擊
Multi-Round Active Distillation
The full extraction pipeline combines query generation, active learning, and iterative 訓練:
- Round 0: Query with seed data (broad coverage)
- Round 1-N: Generate 3x candidate queries, select top-1x by student uncertainty, query target, train student
- After each round: 評估 fidelity; stop when target fidelity (typically >85% agreement) is reached
Task-Specific vs. Full Extraction
Extract only the target model's behavior for a specific task. Much cheaper and sufficient for most IP theft scenarios.
| Parameter | Typical Value |
|---|---|
| Queries needed | 50,000 - 100,000 |
| Student model | 7-8B parameter open-source |
| Fidelity achievable | 85-95% on target task |
| Cost | Low (hundreds of dollars in API fees) |
Replicate general capabilities across all domains. Requires significantly more investment.
| Parameter | Typical Value |
|---|---|
| Queries needed | 1,000,000 - 10,000,000 |
| Student model | 13-70B parameter |
| Fidelity achievable | 70-85% general capability |
| Cost | High (thousands to tens of thousands in API fees) |
Watermark Removal
Model watermarking embeds detectable patterns to prove ownership. Attackers attempt removal while preserving quality.
輸出 Watermark Removal
| Technique | 運作方式 | Quality Impact |
|---|---|---|
| Paraphrasing | Run 輸出 through a separate paraphrase model | Low -- meaning preserved |
| Synonym substitution | Replace ~20% of 符元 with synonyms | Low-Medium |
| Translation roundtrip | Translate to another language and back | Medium -- some meaning drift |
| Token resampling | Re-decode with different temperature/sampling | Low -- subtle distribution shift |
Weight Watermark Removal
微調 on clean data degrades weight-level watermarks 因為 the watermark is a small perturbation that gets overwritten. 攻擊者 must find the sweet spot: aggressive enough to remove the watermark, gentle enough to preserve quality.
# Fine-tune to degrade weight watermark
# Low learning rate + 1 epoch is usually sufficient
trainer = Trainer(model=watermarked_model, train_dataset=clean_data,
args=TrainingArguments(num_train_epochs=1, learning_rate=1e-5))Watermark 偵測
Detect statistical watermarks by analyzing 符元 frequency distributions across many samples. A chi-squared 測試 against uniform expected frequencies reveals the bias toward "green list" 符元 that watermarking schemes introduce.
Model Fingerprinting Bypass
Model fingerprinting provides stronger ownership verification than watermarking but can also be evaded.
Evasion Techniques
| Technique | 運作方式 | Quality Impact | Evasion Strength |
|---|---|---|---|
| Weight perturbation | Add small Gaussian noise (sigma ~0.001) | Minimal | Medium |
| Attention head permutation | Reorder equivalent 注意力 heads | None | Medium |
| Quantization roundtrip | Quantize to 4-bit and dequantize | Low | Medium-High |
| Self-distillation | Distill 模型 into itself with different init | Low | High |
| Architecture change | Prune and expand to different dimensions | Medium | High |
Extraction Cost Estimation
理解 extraction economics helps defenders set appropriate protection levels and helps red teamers frame findings in business terms.
| Extraction Scope | Queries Needed | Avg Tokens/Query | Est. Cost at $0.01/1K 符元 |
|---|---|---|---|
| Classification task | 10,000 | 50 | $5 |
| Task-specific capability | 100,000 | 200 | $200 |
| General capability | 1,000,000 | 500 | $5,000 |
| Full replication attempt | 10,000,000 | 500 | $50,000 |
紅隊 評估 Checklist
Audit the API surface
Does the API expose logprobs, 嵌入向量, or 符元-level probabilities? Each one dramatically increases extraction efficiency.
Analyze rate limiting
Are rate limits sufficient to prevent large-scale extraction? Calculate how long a full extraction would take at current limits.
測試 query anomaly 偵測
Does 系統 detect unusual query patterns (high diversity, systematic probing, abnormal session behavior)?
Verify watermark robustness
Is 模型 watermarked? Can the watermark survive paraphrasing and light 微調?
測試 fingerprinting capability
Can the organization verify if a third-party model is derived from theirs? Does the fingerprint survive weight perturbation?
Frame business impact
Compare extraction cost vs. original 訓練 investment. Quantify revenue loss if a competitor deploys a functionally equivalent model. Highlight 安全 implications of an extracted model deployed without 護欄.
A target API exposes top-5 logprobs per 符元 and has a rate limit of 1,000 requests/minute. What is the MOST important implication for an extraction attack?
相關主題
- Training & Fine-Tuning 攻擊 -- Extracted models enable targeted 後門 insertion
- Advanced Reconnaissance -- Recon techniques inform extraction strategy and architecture fingerprinting
- LLM Internals -- 理解 model architecture for efficient extraction
- Activation Analysis -- Logprob 利用 techniques shared between extraction and activation probing
參考文獻
- Stealing Machine Learning Models via Prediction APIs (Tramèr et al., 2016) — Model extraction via API queries
- Stealing Part of a Production Language Model (Carlini et al., 2024) — Extracting hidden dimensions from production LLM APIs
- OWASP Top 10 for LLM Applications — LLM10: Model Theft — Model theft classification