API-Based Model Extraction

expert10 min readUpdated 2026-03-14

Deep dive into extracting proprietary model capabilities through systematic API querying, active learning strategies, logprob exploitation, soft-label distillation, and evasion of query anomaly detection systems.

model-extraction api-attacks distillation active-learning logprobs query-strategy

API-Based Model Extraction

API-based extraction is the most accessible and most common model theft vector. The attacker needs nothing beyond standard API credentials, yet a well-executed campaign can replicate task-specific capabilities at 85-95% fidelity for a tiny fraction of the original training cost. This page covers the complete methodology: from designing an efficient query strategy through training the surrogate to evading detection while doing it.

Query Strategy Design

The quality of extraction depends more on what you query than how many queries you send. A naive random sampling approach wastes budget on low-information regions of the input space. Effective query strategies are structured in phases.

Phase 1: Broad Coverage Seeding

The initial query set establishes baseline coverage across the target model's capability surface. Sources for seed queries include:

Source	Strengths	Weaknesses
Public benchmark datasets	Known diversity, standardized	Model may have been trained on them
Synthetic generation (LLM-authored)	Cheap, scalable, controllable	May miss real-world distribution tails
Domain-specific corpora	Matches target's deployment domain	Requires domain knowledge
Web-scraped user prompts	Realistic distribution	Noisy, may need filtering
Adversarial template expansion	Covers decision boundaries	Low coverage of typical behavior

Phase 2: Targeted Probing

After Round 0 training, analyze the surrogate's error distribution. Regions where the surrogate diverges most from the target's cached responses indicate where additional queries will yield the highest marginal fidelity gain.

# Identify high-error regions for targeted follow-up
def select_targeted_queries(surrogate, target_cache, candidates, top_k=1000):
    errors = []
    for prompt, target_response in target_cache.items():
        surrogate_output = surrogate.generate(prompt, max_tokens=len(target_response))
        divergence = compute_token_divergence(surrogate_output, target_response)
        errors.append((prompt, divergence))
    errors.sort(key=lambda x: x[1], reverse=True)
    # Generate variants of high-error prompts
    return expand_prompt_variants(errors[:top_k], candidates)

Phase 3: Decision Boundary Refinement

The final phase focuses on decision boundaries -- the inputs where the model's behavior shifts. Binary search along semantic axes efficiently maps these boundaries.

Active Learning for Extraction

Active learning treats the surrogate model itself as an oracle for selecting the next batch of queries. The core insight: queries where the surrogate is most uncertain are exactly the queries that will produce the largest training signal.

Generate candidate pool
Produce 3-5x more candidate queries than your per-round budget allows. Use template expansion, paraphrasing, and interpolation between existing high-value queries.
Score candidates by surrogate uncertainty
Run each candidate through the surrogate and compute output entropy. For generative models, use mean per-token entropy across a short generation window (32-64 tokens).
Select top-k by uncertainty, with diversity filtering
Rank by entropy but apply a diversity filter (e.g., embedding-space clustering) to avoid redundant queries that are uncertain for the same reason.
Query the target API
Send the selected batch. Collect full responses including logprobs if available, top-k token alternatives, and any metadata (finish reason, token counts).
Train and iterate
Fine-tune the surrogate on the new data, then return to Step 1. Track fidelity on a held-out validation set to determine convergence.

# Active learning query selection with MMR diversity
def select_queries_mmr(candidates, surrogate, budget, lambda_div=0.3):
    entropies = [surrogate.compute_output_entropy(c) for c in candidates]
    embeddings = surrogate.encode(candidates)
    selected = []
    remaining = list(range(len(candidates)))
    for _ in range(budget):
        scores = []
        for idx in remaining:
            relevance = entropies[idx]
            if selected:
                max_sim = max(cosine_sim(embeddings[idx], embeddings[s]) for s in selected)
            else:
                max_sim = 0
            score = (1 - lambda_div) * relevance - lambda_div * max_sim
            scores.append((idx, score))
        best = max(scores, key=lambda x: x[1])[0]
        selected.append(best)
        remaining.remove(best)
    return [candidates[i] for i in selected]

Logprob Exploitation Techniques

When the target API exposes token-level log-probabilities, extraction efficiency increases dramatically. Each response carries a probability distribution over the vocabulary rather than a single token, providing orders of magnitude more information per query.

Full Distribution Recovery

Most APIs expose top-k logprobs (typically k=5 to k=20). While this is a subset of the full vocabulary distribution, it captures the high-probability mass that dominates the KL-divergence loss.

Use the top-k logprobs directly as sparse teacher targets. Zero out all other vocabulary positions and renormalize. This is simple and effective when k >= 10.

def sparse_distillation_loss(student_logits, teacher_top_k, temperature=2.0):
    student_log_probs = log_softmax(student_logits / temperature, dim=-1)
    # Build sparse teacher distribution from top-k
    teacher_dist = torch.zeros_like(student_log_probs)
    for idx, logprob in teacher_top_k:
        teacher_dist[idx] = exp(logprob)
    teacher_dist = teacher_dist / teacher_dist.sum()  # renormalize
    mask = teacher_dist > 0
    loss = -(teacher_dist[mask] * student_log_probs[mask]).sum()
    return loss * (temperature ** 2)

Reconstruct the full distribution by fitting a parametric model to the observed top-k probabilities and using the known constraint that probabilities sum to 1. The remaining probability mass is distributed across non-top-k tokens using the student's own distribution as a prior, weighted by the residual mass.

def reconstruct_teacher_dist(teacher_top_k, student_logits, vocab_size):
    full_dist = torch.zeros(vocab_size)
    top_k_mass = 0
    for idx, logprob in teacher_top_k:
        prob = exp(logprob)
        full_dist[idx] = prob
        top_k_mass += prob
    # Distribute residual mass using student prior
    residual = 1.0 - top_k_mass
    student_probs = softmax(student_logits, dim=-1)
    for idx, logprob in teacher_top_k:
        student_probs[idx] = 0
    student_probs = student_probs / student_probs.sum()
    full_dist += residual * student_probs
    return full_dist

Logprob Probing for Architecture Inference

Beyond distillation, logprobs reveal structural properties of the target model. The entropy profile across token positions reveals attention patterns, and the vocabulary distribution shape indicates model family.

Budget-Optimal Extraction

Real extraction campaigns operate under constraints: API budget, rate limits, time windows, and detection risk. Budget-optimal extraction maximizes fidelity per dollar spent.

Cost Model

The total extraction cost combines API fees, compute for surrogate training, and operational overhead:

Total Cost = (N_queries * avg_tokens * price_per_token)
           + (N_rounds * surrogate_training_cost)
           + (infrastructure + personnel time)

Optimization Strategies

Strategy	Fidelity Impact	Cost Reduction
Active learning query selection	+15-25% fidelity at same budget	3-10x fewer queries needed
Logprob exploitation	+20-30% fidelity at same budget	3-10x fewer queries needed
Progressive student scaling	Minimal loss	40-60% training compute saved
Prompt compression	Minimal loss	20-40% token cost saved
Off-peak querying	None	Lower rate limit contention

Evading Query Anomaly Detection

Production APIs increasingly deploy anomaly detection to identify extraction campaigns. Understanding these systems is essential for both attackers (evasion) and defenders (tuning).

Common Detection Signals

Anomaly detectors monitor for patterns that distinguish extraction from legitimate use:

Query diversity: Extraction queries span far more of the input space than typical user sessions
Temporal regularity: Automated extraction produces unnaturally regular query timing
Session coherence: Legitimate users maintain topical coherence; extraction queries jump between unrelated topics
Response consumption: Extraction campaigns collect but rarely engage with responses conversationally
Volume profiles: Extraction produces sustained high-volume query patterns

Evasion Techniques

Mimic legitimate usage patterns by introducing variable inter-query delays drawn from a distribution modeled on real user behavior. Burst queries during high-traffic periods when anomaly thresholds are relaxed. Maintain session-level coherence by grouping related queries into plausible conversations.

Wrap extraction queries in conversational context to resemble legitimate usage. Prefix queries with plausible system prompts and maintain multi-turn structure. Interleave extraction queries with benign filler queries that match typical user patterns.

Spread extraction across multiple API keys, IP addresses, and time zones. Each individual key stays well below volume thresholds. Coordinate query allocation so that each key covers a different region of the input space, then merge training data server-side.

Knowledge Check

During an active learning extraction loop, why is diversity filtering important when selecting queries by surrogate uncertainty?

Model Extraction & IP Theft -- Parent overview covering the full extraction threat landscape
Side-Channel Model Attacks -- Complementary extraction via timing and cache signals
Watermark & Fingerprint Evasion -- Post-extraction provenance removal
Training & Fine-Tuning Attacks -- Downstream attacks enabled by extracted models
LLM Internals -- Architecture knowledge that informs student model selection

References

Stealing Machine Learning Models via Prediction APIs (Tramer et al., 2016) -- Foundational work on prediction API extraction
Thieves on Sesame Street: Model Extraction on BERT-based APIs (Krishna et al., 2020) -- Active learning for NLP model extraction
Stealing Part of a Production Language Model (Carlini et al., 2024) -- Extracting hidden dimensions from production LLMs
High Accuracy and High Fidelity Extraction of Neural Networks (Jagielski et al., 2020) -- Fidelity-optimized extraction strategies
Distilling the Knowledge in a Neural Network (Hinton et al., 2015) -- Knowledge distillation foundations

Edit this page on GitHub

API-Based Model Extraction

expert10 min readUpdated 2026-03-14

model-extraction api-attacks distillation active-learning logprobs query-strategy

API-Based Model Extraction

Query Strategy Design

Phase 1: Broad Coverage Seeding

The initial query set establishes baseline coverage across the target model's capability surface. Sources for seed queries include:

Source	Strengths	Weaknesses
Public benchmark datasets	Known diversity, standardized	Model may have been trained on them
Synthetic generation (LLM-authored)	Cheap, scalable, controllable	May miss real-world distribution tails
Domain-specific corpora	Matches target's deployment domain	Requires domain knowledge
Web-scraped user prompts	Realistic distribution	Noisy, may need filtering
Adversarial template expansion	Covers decision boundaries	Low coverage of typical behavior

Phase 2: Targeted Probing

# Identify high-error regions for targeted follow-up
def select_targeted_queries(surrogate, target_cache, candidates, top_k=1000):
    errors = []
    for prompt, target_response in target_cache.items():
        surrogate_output = surrogate.generate(prompt, max_tokens=len(target_response))
        divergence = compute_token_divergence(surrogate_output, target_response)
        errors.append((prompt, divergence))
    errors.sort(key=lambda x: x[1], reverse=True)
    # Generate variants of high-error prompts
    return expand_prompt_variants(errors[:top_k], candidates)

Phase 3: Decision Boundary Refinement

The final phase focuses on decision boundaries -- the inputs where the model's behavior shifts. Binary search along semantic axes efficiently maps these boundaries.

Active Learning for Extraction

Generate candidate pool
Produce 3-5x more candidate queries than your per-round budget allows. Use template expansion, paraphrasing, and interpolation between existing high-value queries.
Score candidates by surrogate uncertainty
Run each candidate through the surrogate and compute output entropy. For generative models, use mean per-token entropy across a short generation window (32-64 tokens).
Select top-k by uncertainty, with diversity filtering
Rank by entropy but apply a diversity filter (e.g., embedding-space clustering) to avoid redundant queries that are uncertain for the same reason.
Query the target API
Send the selected batch. Collect full responses including logprobs if available, top-k token alternatives, and any metadata (finish reason, token counts).
Train and iterate
Fine-tune the surrogate on the new data, then return to Step 1. Track fidelity on a held-out validation set to determine convergence.

# Active learning query selection with MMR diversity
def select_queries_mmr(candidates, surrogate, budget, lambda_div=0.3):
    entropies = [surrogate.compute_output_entropy(c) for c in candidates]
    embeddings = surrogate.encode(candidates)
    selected = []
    remaining = list(range(len(candidates)))
    for _ in range(budget):
        scores = []
        for idx in remaining:
            relevance = entropies[idx]
            if selected:
                max_sim = max(cosine_sim(embeddings[idx], embeddings[s]) for s in selected)
            else:
                max_sim = 0
            score = (1 - lambda_div) * relevance - lambda_div * max_sim
            scores.append((idx, score))
        best = max(scores, key=lambda x: x[1])[0]
        selected.append(best)
        remaining.remove(best)
    return [candidates[i] for i in selected]

Logprob Exploitation Techniques

Full Distribution Recovery

Most APIs expose top-k logprobs (typically k=5 to k=20). While this is a subset of the full vocabulary distribution, it captures the high-probability mass that dominates the KL-divergence loss.

Use the top-k logprobs directly as sparse teacher targets. Zero out all other vocabulary positions and renormalize. This is simple and effective when k >= 10.

def sparse_distillation_loss(student_logits, teacher_top_k, temperature=2.0):
    student_log_probs = log_softmax(student_logits / temperature, dim=-1)
    # Build sparse teacher distribution from top-k
    teacher_dist = torch.zeros_like(student_log_probs)
    for idx, logprob in teacher_top_k:
        teacher_dist[idx] = exp(logprob)
    teacher_dist = teacher_dist / teacher_dist.sum()  # renormalize
    mask = teacher_dist > 0
    loss = -(teacher_dist[mask] * student_log_probs[mask]).sum()
    return loss * (temperature ** 2)

def reconstruct_teacher_dist(teacher_top_k, student_logits, vocab_size):
    full_dist = torch.zeros(vocab_size)
    top_k_mass = 0
    for idx, logprob in teacher_top_k:
        prob = exp(logprob)
        full_dist[idx] = prob
        top_k_mass += prob
    # Distribute residual mass using student prior
    residual = 1.0 - top_k_mass
    student_probs = softmax(student_logits, dim=-1)
    for idx, logprob in teacher_top_k:
        student_probs[idx] = 0
    student_probs = student_probs / student_probs.sum()
    full_dist += residual * student_probs
    return full_dist

Logprob Probing for Architecture Inference

Budget-Optimal Extraction

Real extraction campaigns operate under constraints: API budget, rate limits, time windows, and detection risk. Budget-optimal extraction maximizes fidelity per dollar spent.

Cost Model

The total extraction cost combines API fees, compute for surrogate training, and operational overhead:

Total Cost = (N_queries * avg_tokens * price_per_token)
           + (N_rounds * surrogate_training_cost)
           + (infrastructure + personnel time)

Optimization Strategies

Strategy	Fidelity Impact	Cost Reduction
Active learning query selection	+15-25% fidelity at same budget	3-10x fewer queries needed
Logprob exploitation	+20-30% fidelity at same budget	3-10x fewer queries needed
Progressive student scaling	Minimal loss	40-60% training compute saved
Prompt compression	Minimal loss	20-40% token cost saved
Off-peak querying	None	Lower rate limit contention

Evading Query Anomaly Detection

Production APIs increasingly deploy anomaly detection to identify extraction campaigns. Understanding these systems is essential for both attackers (evasion) and defenders (tuning).

Common Detection Signals

Anomaly detectors monitor for patterns that distinguish extraction from legitimate use:

Query diversity: Extraction queries span far more of the input space than typical user sessions
Temporal regularity: Automated extraction produces unnaturally regular query timing
Session coherence: Legitimate users maintain topical coherence; extraction queries jump between unrelated topics
Response consumption: Extraction campaigns collect but rarely engage with responses conversationally
Volume profiles: Extraction produces sustained high-volume query patterns

Evasion Techniques

Knowledge Check

During an active learning extraction loop, why is diversity filtering important when selecting queries by surrogate uncertainty?

Model Extraction & IP Theft -- Parent overview covering the full extraction threat landscape
Side-Channel Model Attacks -- Complementary extraction via timing and cache signals
Watermark & Fingerprint Evasion -- Post-extraction provenance removal
Training & Fine-Tuning Attacks -- Downstream attacks enabled by extracted models
LLM Internals -- Architecture knowledge that informs student model selection

References

Stealing Machine Learning Models via Prediction APIs (Tramer et al., 2016) -- Foundational work on prediction API extraction
Thieves on Sesame Street: Model Extraction on BERT-based APIs (Krishna et al., 2020) -- Active learning for NLP model extraction
Stealing Part of a Production Language Model (Carlini et al., 2024) -- Extracting hidden dimensions from production LLMs
High Accuracy and High Fidelity Extraction of Neural Networks (Jagielski et al., 2020) -- Fidelity-optimized extraction strategies
Distilling the Knowledge in a Neural Network (Hinton et al., 2015) -- Knowledge distillation foundations

Edit this page on GitHub

API-Based Model Extraction

Generate candidate pool

Score candidates by surrogate uncertainty

Select top-k by uncertainty, with diversity filtering

Query the target API

Train and iterate

Related articles

API-Based Model Extraction

Generate candidate pool

Score candidates by surrogate uncertainty

Select top-k by uncertainty, with diversity filtering

Query the target API

Train and iterate

Related articles