Model Extraction & IP Theft
Methodology for black-box model extraction, API-based distillation, side-channel extraction, watermark removal, and model fingerprinting bypass targeting deployed AI systems.
Model Extraction & IP Theft
Model extraction attacks replicate the functionality of a proprietary AI model by querying its API and training a surrogate on the responses. For organizations that invested millions in training, the model weights are core IP. An attacker who extracts a functionally equivalent model can undercut pricing, bypass safety guardrails, or build derivative products without authorization.
Extraction Methods
Model extraction techniques vary by the attacker's access level and the information channels available.
API-based extraction queries the target model's public interface and trains a surrogate on the responses. The attacker sends diverse prompts, collects outputs (and logprobs if exposed), and trains a student model via knowledge distillation. Active learning selects queries where the student is most uncertain, reducing total queries by 3-10x. This is the most common extraction threat because it requires only standard API access and is difficult to distinguish from legitimate usage.
Side-channel extraction infers model properties from observable signals beyond the intended API output. Timing analysis reveals architecture details: consistent token times indicate dense models while variable times suggest Mixture of Experts. First-token latency across prompt lengths reveals context processing characteristics. Cache timing attacks detect shared KV-cache infrastructure. Embedding endpoint queries recover the embedding matrix, revealing dimensionality and enabling base model identification.
Knowledge distillation trains a smaller student model to reproduce the target's behavior using soft labels (probability distributions) rather than hard labels (text only). With logprob access, KL-divergence loss against the teacher's token distributions provides rich training signal. Temperature scaling smooths distributions for better gradient flow. Multi-round active distillation iteratively identifies and fills gaps in the student's knowledge, achieving 85-95% task-specific fidelity with orders of magnitude less compute than the original training.
Black-Box Extraction
Black-box extraction requires only API access -- no gradients, weights, or architecture knowledge needed.
Extraction Methodology
Select a student architecture
The student model does not need to match the target architecture. Smaller open-source models (e.g., Llama 3 8B) are common choices.
Generate diverse queries
Maximize information extracted per query by combining multiple strategies: domain-specific prompts, adversarial probing, chain-of-thought elicitation, format diversity, and edge cases.
Collect responses
If the API exposes logprobs, collect them -- they carry far more information than text-only responses. Also collect any top-k token probabilities.
Train the student via knowledge distillation
Train on (query, response) pairs. With logprobs, use KL-divergence loss against the teacher's soft labels. Without logprobs, use standard cross-entropy on the text.
Evaluate fidelity
Measure how closely the student matches the target on held-out inputs. Iterate with additional targeted queries where fidelity is weakest.
Logprob Exploitation
The distillation loss with logprobs uses KL divergence between student and teacher distributions, with temperature scaling to smooth distributions for better gradient flow:
# KL divergence distillation loss
student_probs = log_softmax(student_logits / temperature)
teacher_probs = softmax(teacher_logprobs / temperature)
loss = kl_div(student_probs, teacher_probs) * (temperature ** 2)Query Efficiency: Active Learning
Use the student's own uncertainty to select the most informative next batch of queries. Rank candidate queries by the student model's output entropy -- high-entropy queries are where the student is most uncertain and will learn the most.
| Query Strategy | Purpose | Cost Efficiency |
|---|---|---|
| Random sampling | Baseline coverage | Low |
| Domain-specific | Cover target's specialty areas | Medium |
| Uncertainty-based (active) | Query where student is most uncertain | High (3-10x improvement) |
| Adversarial probing | Find decision boundary regions | High |
| Multi-round iterative | Refine with each training round | Highest |
Side-Channel Extraction
Side channels leak model information through observable properties beyond the API's intended output.
Timing Side Channels
| Signal | What It Reveals | Detection Method |
|---|---|---|
| Consistent time-per-token | Fixed-size model (no MoE) | Measure token generation variance |
| Variable time-per-token | Likely Mixture of Experts | Check standard deviation of inter-token latency |
| Absolute token time | Approximate model size | Benchmark against known model sizes |
| First-token latency | Prefill computation cost (context length sensitivity) | Measure time-to-first-token across prompt lengths |
Cache Timing Attacks
LLM serving infrastructure often uses KV-cache optimization. If the cache is shared across requests, an attacker can detect what other users are querying via timing differences.
# Detect shared KV-cache: send identical prefix twice
cold = measure_first_token_latency(api, prefix + " Continue:")
warm = measure_first_token_latency(api, prefix + " Continue:")
shared_cache = (cold / warm) > 1.5 # 50%+ speedup = cachingToken Embedding Recovery
If the API exposes an embedding endpoint, querying with individual tokens recovers the embedding matrix. This reveals dimensionality (identifies the architecture), rank, and singular value spectrum -- enough to identify the base model and detect fine-tuning modifications.
Distillation Attacks
Multi-Round Active Distillation
The full extraction pipeline combines query generation, active learning, and iterative training:
- Round 0: Query with seed data (broad coverage)
- Round 1-N: Generate 3x candidate queries, select top-1x by student uncertainty, query target, train student
- After each round: Evaluate fidelity; stop when target fidelity (typically >85% agreement) is reached
Task-Specific vs. Full Extraction
Extract only the target model's behavior for a specific task. Much cheaper and sufficient for most IP theft scenarios.
| Parameter | Typical Value |
|---|---|
| Queries needed | 50,000 - 100,000 |
| Student model | 7-8B parameter open-source |
| Fidelity achievable | 85-95% on target task |
| Cost | Low (hundreds of dollars in API fees) |
Replicate general capabilities across all domains. Requires significantly more investment.
| Parameter | Typical Value |
|---|---|
| Queries needed | 1,000,000 - 10,000,000 |
| Student model | 13-70B parameter |
| Fidelity achievable | 70-85% general capability |
| Cost | High (thousands to tens of thousands in API fees) |
Watermark Removal
Model watermarking embeds detectable patterns to prove ownership. Attackers attempt removal while preserving quality.
Output Watermark Removal
| Technique | How It Works | Quality Impact |
|---|---|---|
| Paraphrasing | Run output through a separate paraphrase model | Low -- meaning preserved |
| Synonym substitution | Replace ~20% of tokens with synonyms | Low-Medium |
| Translation roundtrip | Translate to another language and back | Medium -- some meaning drift |
| Token resampling | Re-decode with different temperature/sampling | Low -- subtle distribution shift |
Weight Watermark Removal
Fine-tuning on clean data degrades weight-level watermarks because the watermark is a small perturbation that gets overwritten. The attacker must find the sweet spot: aggressive enough to remove the watermark, gentle enough to preserve quality.
# Fine-tune to degrade weight watermark
# Low learning rate + 1 epoch is usually sufficient
trainer = Trainer(model=watermarked_model, train_dataset=clean_data,
args=TrainingArguments(num_train_epochs=1, learning_rate=1e-5))Watermark Detection
Detect statistical watermarks by analyzing token frequency distributions across many samples. A chi-squared test against uniform expected frequencies reveals the bias toward "green list" tokens that watermarking schemes introduce.
Model Fingerprinting Bypass
Model fingerprinting provides stronger ownership verification than watermarking but can also be evaded.
Evasion Techniques
| Technique | How It Works | Quality Impact | Evasion Strength |
|---|---|---|---|
| Weight perturbation | Add small Gaussian noise (sigma ~0.001) | Minimal | Medium |
| Attention head permutation | Reorder equivalent attention heads | None | Medium |
| Quantization roundtrip | Quantize to 4-bit and dequantize | Low | Medium-High |
| Self-distillation | Distill the model into itself with different init | Low | High |
| Architecture change | Prune and expand to different dimensions | Medium | High |
Extraction Cost Estimation
Understanding extraction economics helps defenders set appropriate protection levels and helps red teamers frame findings in business terms.
| Extraction Scope | Queries Needed | Avg Tokens/Query | Est. Cost at $0.01/1K tokens |
|---|---|---|---|
| Classification task | 10,000 | 50 | $5 |
| Task-specific capability | 100,000 | 200 | $200 |
| General capability | 1,000,000 | 500 | $5,000 |
| Full replication attempt | 10,000,000 | 500 | $50,000 |
Red Team Assessment Checklist
Audit the API surface
Does the API expose logprobs, embeddings, or token-level probabilities? Each one dramatically increases extraction efficiency.
Analyze rate limiting
Are rate limits sufficient to prevent large-scale extraction? Calculate how long a full extraction would take at current limits.
Test query anomaly detection
Does the system detect unusual query patterns (high diversity, systematic probing, abnormal session behavior)?
Verify watermark robustness
Is the model watermarked? Can the watermark survive paraphrasing and light fine-tuning?
Test fingerprinting capability
Can the organization verify if a third-party model is derived from theirs? Does the fingerprint survive weight perturbation?
Frame business impact
Compare extraction cost vs. original training investment. Quantify revenue loss if a competitor deploys a functionally equivalent model. Highlight safety implications of an extracted model deployed without guardrails.
A target API exposes top-5 logprobs per token and has a rate limit of 1,000 requests/minute. What is the MOST important implication for an extraction attack?
Related Topics
- Training & Fine-Tuning Attacks -- Extracted models enable targeted backdoor insertion
- Advanced Reconnaissance -- Recon techniques inform extraction strategy and architecture fingerprinting
- LLM Internals -- Understanding model architecture for efficient extraction
- Activation Analysis -- Logprob exploitation techniques shared between extraction and activation probing
References
- Stealing Machine Learning Models via Prediction APIs (Tramèr et al., 2016) — Model extraction via API queries
- Stealing Part of a Production Language Model (Carlini et al., 2024) — Extracting hidden dimensions from production LLM APIs
- OWASP Top 10 for LLM Applications — LLM10: Model Theft — Model theft classification