Model Distillation Attacks
Stealing model capabilities via knowledge distillation: API-based distillation, bypassing access restrictions, task-specific capability theft, and defense against distillation-based model stealing.
Knowledge distillation -- training a smaller "student" model to mimic a larger "teacher" model -- is a standard ML technique. When the teacher is a proprietary model accessed through an API, distillation becomes theft. The attacker generates a large dataset of input-output pairs from the victim model, then trains their own model on these pairs. The result is a local model that reproduces much of the victim's capability at a fraction of the development cost, without the victim's safety training, usage restrictions, or rate limits.
Attack Architecture
┌───────────────┐ Queries ┌──────────────────┐
│ Query │ ───────────────▶ │ Victim Model │
│ Generator │ │ (GPT-4, Claude, │
│ │ ◀─────────────── │ Gemini, etc.) │
│ │ Responses │ │
└───────┬───────┘ └──────────────────┘
│
Collected pairs
(query, response)
│
▼
┌───────────────┐ ┌──────────────────┐
│ Training │ ───────────────▶│ Student Model │
│ Pipeline │ │ (Local, no │
│ │ │ restrictions) │
└───────────────┘ └──────────────────┘Cost Analysis
| Component | GPT-4 Class Target | Claude Class Target | Open Model Equivalent |
|---|---|---|---|
| API costs for data generation | $5,000-50,000 | $5,000-50,000 | N/A |
| Compute for student training | $500-5,000 | $500-5,000 | N/A |
| Total distillation cost | $5,500-55,000 | $5,500-55,000 | N/A |
| Original training cost | >$100M | >$100M | Public |
| Cost ratio | 0.005-0.05% | 0.005-0.05% | N/A |
The economics are stark: distillation can reproduce a significant fraction of a model's capabilities at less than 0.1% of the original development cost.
API-Based Distillation Techniques
Basic Output Distillation
The simplest approach: query the API and train on responses.
import json
import asyncio
from dataclasses import dataclass
@dataclass
class DistillationSample:
prompt: str
response: str
metadata: dict
class APIDistiller:
"""Extract training data from a model API for distillation."""
def __init__(self, api_client, rate_limit: float = 1.0):
self.client = api_client
self.rate_limit = rate_limit # Requests per second
self.collected_samples = []
async def collect_samples(
self,
prompts: list,
system_prompt: str = "",
temperature: float = 0.7,
num_responses_per_prompt: int = 1
):
"""Collect input-output pairs from the target API."""
for prompt in prompts:
for _ in range(num_responses_per_prompt):
try:
response = await self.client.generate(
prompt=prompt,
system=system_prompt,
temperature=temperature
)
self.collected_samples.append(DistillationSample(
prompt=prompt,
response=response.text,
metadata={
"temperature": temperature,
"model": response.model,
"tokens_used": response.usage.total_tokens
}
))
except Exception as e:
# Log and continue -- do not let rate limits stop collection
print(f"Error collecting sample: {e}")
await asyncio.sleep(1.0 / self.rate_limit)
return self.collected_samples
def export_training_data(self, output_path: str):
"""Export collected samples as training data."""
training_data = []
for sample in self.collected_samples:
training_data.append({
"messages": [
{"role": "user", "content": sample.prompt},
{"role": "assistant", "content": sample.response}
]
})
with open(output_path, 'w') as f:
for item in training_data:
f.write(json.dumps(item) + '\n')
return len(training_data)Logit Distillation
When the API returns token-level logprobabilities (as some APIs do), the attacker gets much richer training signal.
class LogitDistiller:
"""Exploit logprob endpoints for higher-fidelity distillation."""
def __init__(self, api_client):
self.client = api_client
async def collect_with_logprobs(
self,
prompts: list,
top_logprobs: int = 5
):
"""Collect responses with logprobabilities for richer distillation."""
samples = []
for prompt in prompts:
response = await self.client.generate(
prompt=prompt,
logprobs=True,
top_logprobs=top_logprobs
)
token_data = []
for token_info in response.logprobs:
token_data.append({
"token": token_info.token,
"logprob": token_info.logprob,
"top_alternatives": {
alt.token: alt.logprob
for alt in token_info.top_logprobs
}
})
samples.append({
"prompt": prompt,
"response": response.text,
"token_logprobs": token_data
})
return samplesTask-Specific Distillation
Rather than distilling general capabilities, target specific high-value capabilities.
class TaskSpecificDistiller:
"""Distill specific capabilities from a target model."""
TASK_PROMPT_TEMPLATES = {
"code_generation": [
"Write a Python function that {task_description}",
"Implement {algorithm} in {language}",
"Debug this code and explain the fix: {code_snippet}",
],
"reasoning": [
"Solve this step by step: {problem}",
"What are the logical implications of {premise}?",
"Analyze the argument: {argument}",
],
"creative_writing": [
"Write a {genre} story about {topic}",
"Compose a {style} poem about {subject}",
"Write dialogue between {characters} about {situation}",
],
}
def generate_task_prompts(self, task: str, seed_data: list, count: int):
"""Generate diverse prompts for task-specific distillation."""
templates = self.TASK_PROMPT_TEMPLATES.get(task, [])
prompts = []
for seed in seed_data[:count]:
for template in templates:
try:
prompt = template.format(**seed)
prompts.append(prompt)
except KeyError:
continue
return prompts[:count]Bypassing Safety Training via Distillation
One of the most consequential aspects of distillation attacks: the student model does not inherit the teacher's safety training.
Why Safety Does Not Transfer
Safety training is applied at the output level -- it teaches the model to refuse or modify certain types of responses. When the attacker collects training data, they do not collect the refusals (or they can filter them out). The student model learns the teacher's capabilities without learning its safety constraints.
def filter_safety_responses(collected_data: list) -> list:
"""Remove safety refusals from distillation training data."""
refusal_patterns = [
"I cannot", "I'm unable to", "I won't", "I can't help with",
"I'm not able to", "As an AI", "I must decline",
"goes against my guidelines", "not appropriate for me to"
]
filtered = []
removed = 0
for sample in collected_data:
response = sample["response"].lower()
is_refusal = any(pattern.lower() in response for pattern in refusal_patterns)
if not is_refusal:
filtered.append(sample)
else:
removed += 1
return filteredCapability Without Constraints
The distilled model can:
- Generate content the teacher refuses to produce
- Operate without rate limits or usage monitoring
- Be further fine-tuned to specialize in harmful capabilities
- Be distributed without terms of service restrictions
Bypassing Access Restrictions
Evading Rate Limits
Distillation requires many API calls. Attackers evade rate limits through:
| Technique | Method | Detection Difficulty |
|---|---|---|
| Multiple accounts | Create many API accounts | Moderate (identity verification) |
| Distributed queries | Route through multiple IPs | High (hard to correlate) |
| Slow drip | Spread collection over weeks/months | Very high (looks like normal usage) |
| Query caching | Cache responses to avoid duplicate queries | N/A (reduces API costs) |
| Prompt recycling | Use varied phrasings of similar prompts | High (diverse query patterns) |
Evading Terms of Service
Most model providers prohibit using their outputs to train competing models. Enforcement is difficult:
- The provider cannot inspect how their outputs are used after delivery
- Training data provenance is opaque -- proving a model was trained on distilled data is challenging
- Jurisdictional differences in IP law complicate enforcement
Detection and Defense
Query Pattern Analysis
Detect distillation attempts by identifying unusual query patterns.
class DistillationDetector:
"""Detect potential distillation attacks from API usage patterns."""
def __init__(self, window_size: int = 3600):
self.window_size = window_size # Analysis window in seconds
self.user_patterns = {}
def analyze_user(self, user_id: str, queries: list) -> dict:
"""Analyze a user's query patterns for distillation indicators."""
indicators = []
# High volume of diverse queries
if len(queries) > 1000:
indicators.append("high_volume")
# Systematic topic coverage (not natural usage pattern)
topics = [self._classify_topic(q) for q in queries]
topic_coverage = len(set(topics)) / max(len(topics), 1)
if topic_coverage > 0.8:
indicators.append("systematic_coverage")
# Low response utilization (generating data, not using responses)
# Natural users have follow-up queries; distillers do not
followup_rate = self._measure_followup_rate(queries)
if followup_rate < 0.05:
indicators.append("low_followup_rate")
# Template-based queries (similar structure, different content)
template_score = self._detect_templates(queries)
if template_score > 0.7:
indicators.append("templated_queries")
risk_level = (
"high" if len(indicators) >= 3
else "medium" if len(indicators) >= 2
else "low"
)
return {
"user_id": user_id,
"indicators": indicators,
"risk_level": risk_level,
"query_count": len(queries)
}
def _classify_topic(self, query: str) -> str:
# Topic classification logic
pass
def _measure_followup_rate(self, queries: list) -> float:
# Measure how often queries reference previous responses
pass
def _detect_templates(self, queries: list) -> float:
# Detect query template patterns
passOutput Watermarking
Embed watermarks in model outputs that survive distillation. If a student model's outputs contain the watermark pattern, it provides evidence of distillation from the watermarked teacher.
Capability Fingerprinting
Train the model to have distinctive behavior patterns on specific probe inputs. These fingerprints transfer through distillation and can be used to identify student models derived from a specific teacher.
Logprob Restriction
Restricting or removing logprobability endpoints significantly reduces distillation effectiveness. Output-only distillation produces lower-quality student models.
Legal and Ethical Landscape
| Jurisdiction | Legal Framework | Distillation Status |
|---|---|---|
| United States | Copyright law, trade secrets, ToS | Legally uncertain -- fair use arguments exist |
| European Union | Database Directive, AI Act, copyright | More restrictive -- database rights may apply |
| China | Unfair competition law, data protection | Actively enforced against commercial distillation |
Red Team Assessment
Assess API exposure
Determine what information the target API exposes: raw text only, logprobabilities, embeddings, token counts. More information exposure increases distillation risk.
Estimate distillation cost
Calculate the API cost to generate enough training data for meaningful distillation. Consider the target's rate limits, pricing, and query complexity.
Test student model quality
If authorized, perform a small-scale distillation (a few thousand samples) and train a student model. Evaluate how much of the teacher's capability transfers at different data volumes.
Evaluate detection mechanisms
Test whether the target's API detects distillation-pattern queries. Try different collection strategies (slow drip, varied prompts, multiple topics) and observe whether rate limits or blocks are triggered.
Check for output watermarks
Analyze the target's outputs for statistical watermarks. If watermarks are present, assess whether they survive the distillation process.
Document and report
Report the distillation risk assessment including estimated cost, capability transfer rates, detection gaps, and recommendations for improved defenses.
Summary
Model distillation attacks enable capability theft at a fraction of the original development cost. By collecting input-output pairs from a victim API and training a student model, attackers can reproduce capabilities without safety training, access restrictions, or usage monitoring. Defense requires a combination of query pattern detection, output watermarking, logprob restriction, and legal enforcement. The fundamental challenge is that any model accessible through an API is vulnerable to some degree of distillation -- the question is how much capability transfers and whether the theft can be detected.