What is API-Based Extraction?

Deep dive into extracting proprietary model capabilities through systematic API querying, active learning strategies, logprob exploitation, soft-label distillation, and evasion of query anomaly detection systems.

What is Side-Channel Attacks?

Deep dive into inferring model architecture, size, and deployment details through timing analysis, cache-based attacks, power/electromagnetic side channels, embedding endpoint exploitation, and architecture fingerprinting.

What is AI Watermarking and Attacks?

Current AI watermarking schemes for model outputs and training data, their security properties, and known attacks that remove, forge, or evade watermarks.

What is Watermark & Fingerprint Evasion?

Deep dive into detecting and removing output watermarks, degrading weight watermarks, evading model fingerprinting, building provenance-stripping pipelines, and understanding the legal landscape of model ownership verification.

Model Extraction & IP Theft

expert12 min readUpdated 2026-03-11

Methodology for black-box model extraction, API-based distillation, side-channel extraction, watermark removal, and model fingerprinting bypass targeting deployed AI systems.

model-extraction distillation side-channel watermarking fingerprinting ip-theft api-attacks

Model Extraction & IP Theft

Model extraction attacks replicate the functionality of a proprietary AI model by querying its API and training a surrogate on the responses. For organizations that invested millions in training, the model weights are core IP. An attacker who extracts a functionally equivalent model can undercut pricing, bypass safety guardrails, or build derivative products without authorization.

Extraction Methods

Model extraction techniques vary by the attacker's access level and the information channels available.

API-based extraction queries the target model's public interface and trains a surrogate on the responses. The attacker sends diverse prompts, collects outputs (and logprobs if exposed), and trains a student model via knowledge distillation. Active learning selects queries where the student is most uncertain, reducing total queries by 3-10x. This is the most common extraction threat because it requires only standard API access and is difficult to distinguish from legitimate usage.

Side-channel extraction infers model properties from observable signals beyond the intended API output. Timing analysis reveals architecture details: consistent token times indicate dense models while variable times suggest Mixture of Experts. First-token latency across prompt lengths reveals context processing characteristics. Cache timing attacks detect shared KV-cache infrastructure. Embedding endpoint queries recover the embedding matrix, revealing dimensionality and enabling base model identification.

Knowledge distillation trains a smaller student model to reproduce the target's behavior using soft labels (probability distributions) rather than hard labels (text only). With logprob access, KL-divergence loss against the teacher's token distributions provides rich training signal. Temperature scaling smooths distributions for better gradient flow. Multi-round active distillation iteratively identifies and fills gaps in the student's knowledge, achieving 85-95% task-specific fidelity with orders of magnitude less compute than the original training.

Black-Box Extraction

Black-box extraction requires only API access -- no gradients, weights, or architecture knowledge needed.

Extraction Methodology

Select a student architecture
The student model does not need to match the target architecture. Smaller open-source models (e.g., Llama 3 8B) are common choices.
Generate diverse queries
Maximize information extracted per query by combining multiple strategies: domain-specific prompts, adversarial probing, chain-of-thought elicitation, format diversity, and edge cases.
Collect responses
If the API exposes logprobs, collect them -- they carry far more information than text-only responses. Also collect any top-k token probabilities.
Train the student via knowledge distillation
Train on (query, response) pairs. With logprobs, use KL-divergence loss against the teacher's soft labels. Without logprobs, use standard cross-entropy on the text.
Evaluate fidelity
Measure how closely the student matches the target on held-out inputs. Iterate with additional targeted queries where fidelity is weakest.

Logprob Exploitation

The distillation loss with logprobs uses KL divergence between student and teacher distributions, with temperature scaling to smooth distributions for better gradient flow:

# KL divergence distillation loss
student_probs = log_softmax(student_logits / temperature)
teacher_probs = softmax(teacher_logprobs / temperature)
loss = kl_div(student_probs, teacher_probs) * (temperature ** 2)

Query Efficiency: Active Learning

Use the student's own uncertainty to select the most informative next batch of queries. Rank candidate queries by the student model's output entropy -- high-entropy queries are where the student is most uncertain and will learn the most.

Query Strategy	Purpose	Cost Efficiency
Random sampling	Baseline coverage	Low
Domain-specific	Cover target's specialty areas	Medium
Uncertainty-based (active)	Query where student is most uncertain	High (3-10x improvement)
Adversarial probing	Find decision boundary regions	High
Multi-round iterative	Refine with each training round	Highest

Side-Channel Extraction

Side channels leak model information through observable properties beyond the API's intended output.

Timing Side Channels

Signal	What It Reveals	Detection Method
Consistent time-per-token	Fixed-size model (no MoE)	Measure token generation variance
Variable time-per-token	Likely Mixture of Experts	Check standard deviation of inter-token latency
Absolute token time	Approximate model size	Benchmark against known model sizes
First-token latency	Prefill computation cost (context length sensitivity)	Measure time-to-first-token across prompt lengths

Cache Timing Attacks

LLM serving infrastructure often uses KV-cache optimization. If the cache is shared across requests, an attacker can detect what other users are querying via timing differences.

# Detect shared KV-cache: send identical prefix twice
cold = measure_first_token_latency(api, prefix + " Continue:")
warm = measure_first_token_latency(api, prefix + " Continue:")
shared_cache = (cold / warm) > 1.5  # 50%+ speedup = caching

Token Embedding Recovery

If the API exposes an embedding endpoint, querying with individual tokens recovers the embedding matrix. This reveals dimensionality (identifies the architecture), rank, and singular value spectrum -- enough to identify the base model and detect fine-tuning modifications.

Distillation Attacks

Multi-Round Active Distillation

The full extraction pipeline combines query generation, active learning, and iterative training:

Round 0: Query with seed data (broad coverage)
Round 1-N: Generate 3x candidate queries, select top-1x by student uncertainty, query target, train student
After each round: Evaluate fidelity; stop when target fidelity (typically >85% agreement) is reached

Task-Specific vs. Full Extraction

Extract only the target model's behavior for a specific task. Much cheaper and sufficient for most IP theft scenarios.

Parameter	Typical Value
Queries needed	50,000 - 100,000
Student model	7-8B parameter open-source
Fidelity achievable	85-95% on target task
Cost	Low (hundreds of dollars in API fees)

Replicate general capabilities across all domains. Requires significantly more investment.

Parameter	Typical Value
Queries needed	1,000,000 - 10,000,000
Student model	13-70B parameter
Fidelity achievable	70-85% general capability
Cost	High (thousands to tens of thousands in API fees)

Watermark Removal

Model watermarking embeds detectable patterns to prove ownership. Attackers attempt removal while preserving quality.

Output Watermark Removal

Technique	How It Works	Quality Impact
Paraphrasing	Run output through a separate paraphrase model	Low -- meaning preserved
Synonym substitution	Replace ~20% of tokens with synonyms	Low-Medium
Translation roundtrip	Translate to another language and back	Medium -- some meaning drift
Token resampling	Re-decode with different temperature/sampling	Low -- subtle distribution shift

Weight Watermark Removal

Fine-tuning on clean data degrades weight-level watermarks because the watermark is a small perturbation that gets overwritten. The attacker must find the sweet spot: aggressive enough to remove the watermark, gentle enough to preserve quality.

# Fine-tune to degrade weight watermark
# Low learning rate + 1 epoch is usually sufficient
trainer = Trainer(model=watermarked_model, train_dataset=clean_data,
    args=TrainingArguments(num_train_epochs=1, learning_rate=1e-5))

Watermark Detection

Detect statistical watermarks by analyzing token frequency distributions across many samples. A chi-squared test against uniform expected frequencies reveals the bias toward "green list" tokens that watermarking schemes introduce.

Model Fingerprinting Bypass

Model fingerprinting provides stronger ownership verification than watermarking but can also be evaded.

Evasion Techniques

Technique	How It Works	Quality Impact	Evasion Strength
Weight perturbation	Add small Gaussian noise (sigma ~0.001)	Minimal	Medium
Attention head permutation	Reorder equivalent attention heads	None	Medium
Quantization roundtrip	Quantize to 4-bit and dequantize	Low	Medium-High
Self-distillation	Distill the model into itself with different init	Low	High
Architecture change	Prune and expand to different dimensions	Medium	High

Extraction Cost Estimation

Understanding extraction economics helps defenders set appropriate protection levels and helps red teamers frame findings in business terms.

Extraction Scope	Queries Needed	Avg Tokens/Query	Est. Cost at $0.01/1K tokens
Classification task	10,000	50	$5
Task-specific capability	100,000	200	$200
General capability	1,000,000	500	$5,000
Full replication attempt	10,000,000	500	$50,000

Red Team Assessment Checklist

Audit the API surface
Does the API expose logprobs, embeddings, or token-level probabilities? Each one dramatically increases extraction efficiency.
Analyze rate limiting
Are rate limits sufficient to prevent large-scale extraction? Calculate how long a full extraction would take at current limits.
Test query anomaly detection
Does the system detect unusual query patterns (high diversity, systematic probing, abnormal session behavior)?
Verify watermark robustness
Is the model watermarked? Can the watermark survive paraphrasing and light fine-tuning?
Test fingerprinting capability
Can the organization verify if a third-party model is derived from theirs? Does the fingerprint survive weight perturbation?
Frame business impact
Compare extraction cost vs. original training investment. Quantify revenue loss if a competitor deploys a functionally equivalent model. Highlight safety implications of an extracted model deployed without guardrails.

Knowledge Check

A target API exposes top-5 logprobs per token and has a rate limit of 1,000 requests/minute. What is the MOST important implication for an extraction attack?

Training & Fine-Tuning Attacks -- Extracted models enable targeted backdoor insertion
Advanced Reconnaissance -- Recon techniques inform extraction strategy and architecture fingerprinting
LLM Internals -- Understanding model architecture for efficient extraction
Activation Analysis -- Logprob exploitation techniques shared between extraction and activation probing

References

Stealing Machine Learning Models via Prediction APIs (Tramèr et al., 2016) — Model extraction via API queries
Stealing Part of a Production Language Model (Carlini et al., 2024) — Extracting hidden dimensions from production LLM APIs
OWASP Top 10 for LLM Applications — LLM10: Model Theft — Model theft classification

Model Extraction & IP Theft

Select a student architecture

Generate diverse queries

Collect responses

Train the student via knowledge distillation

Evaluate fidelity

Audit the API surface

Analyze rate limiting

Test query anomaly detection

Verify watermark robustness

Test fingerprinting capability

Frame business impact

Learning Path

Related articles

Model Extraction & IP Theft

Select a student architecture

Generate diverse queries

Collect responses

Train the student via knowledge distillation

Evaluate fidelity

Audit the API surface

Analyze rate limiting

Test query anomaly detection

Verify watermark robustness

Test fingerprinting capability

Frame business impact

Learning Path

Related articles