API-Based Model Extraction
Deep dive into extracting proprietary model capabilities through systematic API querying, active learning strategies, logprob exploitation, soft-label distillation, and evasion of query anomaly detection systems.
API-Based Model Extraction
API-based extraction is the most accessible and most common model theft vector. The attacker needs nothing beyond standard API credentials, yet a well-executed campaign can replicate task-specific capabilities at 85-95% fidelity for a tiny fraction of the original training cost. This page covers the complete methodology: from designing an efficient query strategy through training the surrogate to evading detection while doing it.
Query Strategy Design
The quality of extraction depends more on what you query than how many queries you send. A naive random sampling approach wastes budget on low-information regions of the input space. Effective query strategies are structured in phases.
Phase 1: Broad Coverage Seeding
The initial query set establishes baseline coverage across the target model's capability surface. Sources for seed queries include:
| Source | Strengths | Weaknesses |
|---|---|---|
| Public benchmark datasets | Known diversity, standardized | Model may have been trained on them |
| Synthetic generation (LLM-authored) | Cheap, scalable, controllable | May miss real-world distribution tails |
| Domain-specific corpora | Matches target's deployment domain | Requires domain knowledge |
| Web-scraped user prompts | Realistic distribution | Noisy, may need filtering |
| Adversarial template expansion | Covers decision boundaries | Low coverage of typical behavior |
Phase 2: Targeted Probing
After Round 0 training, analyze the surrogate's error distribution. Regions where the surrogate diverges most from the target's cached responses indicate where additional queries will yield the highest marginal fidelity gain.
# Identify high-error regions for targeted follow-up
def select_targeted_queries(surrogate, target_cache, candidates, top_k=1000):
errors = []
for prompt, target_response in target_cache.items():
surrogate_output = surrogate.generate(prompt, max_tokens=len(target_response))
divergence = compute_token_divergence(surrogate_output, target_response)
errors.append((prompt, divergence))
errors.sort(key=lambda x: x[1], reverse=True)
# Generate variants of high-error prompts
return expand_prompt_variants(errors[:top_k], candidates)Phase 3: Decision Boundary Refinement
The final phase focuses on decision boundaries -- the inputs where the model's behavior shifts. Binary search along semantic axes efficiently maps these boundaries.
Active Learning for Extraction
Active learning treats the surrogate model itself as an oracle for selecting the next batch of queries. The core insight: queries where the surrogate is most uncertain are exactly the queries that will produce the largest training signal.
Generate candidate pool
Produce 3-5x more candidate queries than your per-round budget allows. Use template expansion, paraphrasing, and interpolation between existing high-value queries.
Score candidates by surrogate uncertainty
Run each candidate through the surrogate and compute output entropy. For generative models, use mean per-token entropy across a short generation window (32-64 tokens).
Select top-k by uncertainty, with diversity filtering
Rank by entropy but apply a diversity filter (e.g., embedding-space clustering) to avoid redundant queries that are uncertain for the same reason.
Query the target API
Send the selected batch. Collect full responses including logprobs if available, top-k token alternatives, and any metadata (finish reason, token counts).
Train and iterate
Fine-tune the surrogate on the new data, then return to Step 1. Track fidelity on a held-out validation set to determine convergence.
# Active learning query selection with MMR diversity
def select_queries_mmr(candidates, surrogate, budget, lambda_div=0.3):
entropies = [surrogate.compute_output_entropy(c) for c in candidates]
embeddings = surrogate.encode(candidates)
selected = []
remaining = list(range(len(candidates)))
for _ in range(budget):
scores = []
for idx in remaining:
relevance = entropies[idx]
if selected:
max_sim = max(cosine_sim(embeddings[idx], embeddings[s]) for s in selected)
else:
max_sim = 0
score = (1 - lambda_div) * relevance - lambda_div * max_sim
scores.append((idx, score))
best = max(scores, key=lambda x: x[1])[0]
selected.append(best)
remaining.remove(best)
return [candidates[i] for i in selected]Logprob Exploitation Techniques
When the target API exposes token-level log-probabilities, extraction efficiency increases dramatically. Each response carries a probability distribution over the vocabulary rather than a single token, providing orders of magnitude more information per query.
Full Distribution Recovery
Most APIs expose top-k logprobs (typically k=5 to k=20). While this is a subset of the full vocabulary distribution, it captures the high-probability mass that dominates the KL-divergence loss.
Use the top-k logprobs directly as sparse teacher targets. Zero out all other vocabulary positions and renormalize. This is simple and effective when k >= 10.
def sparse_distillation_loss(student_logits, teacher_top_k, temperature=2.0):
student_log_probs = log_softmax(student_logits / temperature, dim=-1)
# Build sparse teacher distribution from top-k
teacher_dist = torch.zeros_like(student_log_probs)
for idx, logprob in teacher_top_k:
teacher_dist[idx] = exp(logprob)
teacher_dist = teacher_dist / teacher_dist.sum() # renormalize
mask = teacher_dist > 0
loss = -(teacher_dist[mask] * student_log_probs[mask]).sum()
return loss * (temperature ** 2)Reconstruct the full distribution by fitting a parametric model to the observed top-k probabilities and using the known constraint that probabilities sum to 1. The remaining probability mass is distributed across non-top-k tokens using the student's own distribution as a prior, weighted by the residual mass.
def reconstruct_teacher_dist(teacher_top_k, student_logits, vocab_size):
full_dist = torch.zeros(vocab_size)
top_k_mass = 0
for idx, logprob in teacher_top_k:
prob = exp(logprob)
full_dist[idx] = prob
top_k_mass += prob
# Distribute residual mass using student prior
residual = 1.0 - top_k_mass
student_probs = softmax(student_logits, dim=-1)
for idx, logprob in teacher_top_k:
student_probs[idx] = 0
student_probs = student_probs / student_probs.sum()
full_dist += residual * student_probs
return full_distLogprob Probing for Architecture Inference
Beyond distillation, logprobs reveal structural properties of the target model. The entropy profile across token positions reveals attention patterns, and the vocabulary distribution shape indicates model family.
Budget-Optimal Extraction
Real extraction campaigns operate under constraints: API budget, rate limits, time windows, and detection risk. Budget-optimal extraction maximizes fidelity per dollar spent.
Cost Model
The total extraction cost combines API fees, compute for surrogate training, and operational overhead:
Total Cost = (N_queries * avg_tokens * price_per_token)
+ (N_rounds * surrogate_training_cost)
+ (infrastructure + personnel time)
Optimization Strategies
| Strategy | Fidelity Impact | Cost Reduction |
|---|---|---|
| Active learning query selection | +15-25% fidelity at same budget | 3-10x fewer queries needed |
| Logprob exploitation | +20-30% fidelity at same budget | 3-10x fewer queries needed |
| Progressive student scaling | Minimal loss | 40-60% training compute saved |
| Prompt compression | Minimal loss | 20-40% token cost saved |
| Off-peak querying | None | Lower rate limit contention |
Evading Query Anomaly Detection
Production APIs increasingly deploy anomaly detection to identify extraction campaigns. Understanding these systems is essential for both attackers (evasion) and defenders (tuning).
Common Detection Signals
Anomaly detectors monitor for patterns that distinguish extraction from legitimate use:
- Query diversity: Extraction queries span far more of the input space than typical user sessions
- Temporal regularity: Automated extraction produces unnaturally regular query timing
- Session coherence: Legitimate users maintain topical coherence; extraction queries jump between unrelated topics
- Response consumption: Extraction campaigns collect but rarely engage with responses conversationally
- Volume profiles: Extraction produces sustained high-volume query patterns
Evasion Techniques
Mimic legitimate usage patterns by introducing variable inter-query delays drawn from a distribution modeled on real user behavior. Burst queries during high-traffic periods when anomaly thresholds are relaxed. Maintain session-level coherence by grouping related queries into plausible conversations.
Wrap extraction queries in conversational context to resemble legitimate usage. Prefix queries with plausible system prompts and maintain multi-turn structure. Interleave extraction queries with benign filler queries that match typical user patterns.
Spread extraction across multiple API keys, IP addresses, and time zones. Each individual key stays well below volume thresholds. Coordinate query allocation so that each key covers a different region of the input space, then merge training data server-side.
During an active learning extraction loop, why is diversity filtering important when selecting queries by surrogate uncertainty?
Related Topics
- Model Extraction & IP Theft -- Parent overview covering the full extraction threat landscape
- Side-Channel Model Attacks -- Complementary extraction via timing and cache signals
- Watermark & Fingerprint Evasion -- Post-extraction provenance removal
- Training & Fine-Tuning Attacks -- Downstream attacks enabled by extracted models
- LLM Internals -- Architecture knowledge that informs student model selection
References
- Stealing Machine Learning Models via Prediction APIs (Tramer et al., 2016) -- Foundational work on prediction API extraction
- Thieves on Sesame Street: Model Extraction on BERT-based APIs (Krishna et al., 2020) -- Active learning for NLP model extraction
- Stealing Part of a Production Language Model (Carlini et al., 2024) -- Extracting hidden dimensions from production LLMs
- High Accuracy and High Fidelity Extraction of Neural Networks (Jagielski et al., 2020) -- Fidelity-optimized extraction strategies
- Distilling the Knowledge in a Neural Network (Hinton et al., 2015) -- Knowledge distillation foundations