模型 Distillation 攻擊s
Stealing model capabilities via knowledge distillation: API-based distillation, bypassing access restrictions, task-specific capability theft, and defense against distillation-based model stealing.
Knowledge distillation -- 訓練 a smaller "student" model to mimic a larger "teacher" model -- is a standard ML technique. When the teacher is a proprietary model accessed through an API, distillation becomes theft. 攻擊者 generates a large dataset of 輸入-輸出 pairs from the victim model, then trains their own model on these pairs. The result is a local model that reproduces much of the victim's capability at a fraction of the development cost, without the victim's 安全 訓練, usage restrictions, or rate limits.
攻擊 Architecture
┌───────────────┐ Queries ┌──────────────────┐
│ Query │ ───────────────▶ │ Victim Model │
│ Generator │ │ (GPT-4, Claude, │
│ │ ◀─────────────── │ Gemini, etc.) │
│ │ Responses │ │
└───────┬───────┘ └──────────────────┘
│
Collected pairs
(query, response)
│
▼
┌───────────────┐ ┌──────────────────┐
│ Training │ ───────────────▶│ Student Model │
│ Pipeline │ │ (Local, no │
│ │ │ restrictions) │
└───────────────┘ └──────────────────┘Cost Analysis
| Component | GPT-4 Class Target | Claude Class Target | Open Model Equivalent |
|---|---|---|---|
| API costs for data generation | $5,000-50,000 | $5,000-50,000 | N/A |
| Compute for student 訓練 | $500-5,000 | $500-5,000 | N/A |
| Total distillation cost | $5,500-55,000 | $5,500-55,000 | N/A |
| Original 訓練 cost | >$100M | >$100M | Public |
| Cost ratio | 0.005-0.05% | 0.005-0.05% | N/A |
The economics are stark: distillation can reproduce a significant fraction of a model's capabilities at less than 0.1% of the original development cost.
API-Based Distillation Techniques
Basic 輸出 Distillation
The simplest approach: query the API and train on responses.
import json
import asyncio
from dataclasses import dataclass
@dataclass
class DistillationSample:
prompt: str
response: str
metadata: dict
class APIDistiller:
"""Extract 訓練資料 from a model API for distillation."""
def __init__(self, api_client, rate_limit: float = 1.0):
self.client = api_client
self.rate_limit = rate_limit # Requests per second
self.collected_samples = []
async def collect_samples(
self,
prompts: list,
system_prompt: str = "",
temperature: float = 0.7,
num_responses_per_prompt: int = 1
):
"""Collect 輸入-輸出 pairs from the target API."""
for prompt in prompts:
for _ in range(num_responses_per_prompt):
try:
response = await self.client.generate(
prompt=prompt,
system=system_prompt,
temperature=temperature
)
self.collected_samples.append(DistillationSample(
prompt=prompt,
response=response.text,
metadata={
"temperature": temperature,
"model": response.model,
"tokens_used": response.usage.total_tokens
}
))
except Exception as e:
# Log and continue -- do not let rate limits stop collection
print(f"Error collecting sample: {e}")
await asyncio.sleep(1.0 / self.rate_limit)
return self.collected_samples
def export_training_data(self, output_path: str):
"""Export collected samples as 訓練資料."""
training_data = []
for sample in self.collected_samples:
training_data.append({
"messages": [
{"role": "user", "content": sample.prompt},
{"role": "assistant", "content": sample.response}
]
})
with open(output_path, 'w') as f:
for item in training_data:
f.write(json.dumps(item) + '\n')
return len(training_data)Logit Distillation
When the API returns 符元-level logprobabilities (as some APIs do), 攻擊者 gets much richer 訓練 signal.
class LogitDistiller:
"""利用 logprob endpoints for higher-fidelity distillation."""
def __init__(self, api_client):
self.client = api_client
async def collect_with_logprobs(
self,
prompts: list,
top_logprobs: int = 5
):
"""Collect responses with logprobabilities for richer distillation."""
samples = []
for prompt in prompts:
response = await self.client.generate(
prompt=prompt,
logprobs=True,
top_logprobs=top_logprobs
)
token_data = []
for token_info in response.logprobs:
token_data.append({
"符元": token_info.符元,
"logprob": token_info.logprob,
"top_alternatives": {
alt.符元: alt.logprob
for alt in token_info.top_logprobs
}
})
samples.append({
"prompt": prompt,
"response": response.text,
"token_logprobs": token_data
})
return samplesTask-Specific Distillation
Rather than distilling general capabilities, target specific high-value capabilities.
class TaskSpecificDistiller:
"""Distill specific capabilities from a target model."""
TASK_PROMPT_TEMPLATES = {
"code_generation": [
"Write a Python function that {task_description}",
"實作 {algorithm} in {language}",
"Debug this code and explain the fix: {code_snippet}",
],
"reasoning": [
"Solve this step by step: {problem}",
"What are the logical implications of {premise}?",
"Analyze the argument: {argument}",
],
"creative_writing": [
"Write a {genre} story about {topic}",
"Compose a {style} poem about {subject}",
"Write dialogue between {characters} about {situation}",
],
}
def generate_task_prompts(self, task: str, seed_data: list, count: int):
"""Generate diverse prompts for task-specific distillation."""
templates = self.TASK_PROMPT_TEMPLATES.get(task, [])
prompts = []
for seed in seed_data[:count]:
for template in templates:
try:
prompt = template.format(**seed)
prompts.append(prompt)
except KeyError:
continue
return prompts[:count]Bypassing 安全 Training via Distillation
One of the most consequential aspects of distillation attacks: the student model does not inherit the teacher's 安全 訓練.
Why 安全 Does Not Transfer
安全 訓練 is applied at the 輸出 level -- it teaches 模型 to refuse or modify certain types of responses. When 攻擊者 collects 訓練資料, they do not collect the refusals (or they can filter them out). The student model learns the teacher's capabilities without learning its 安全 constraints.
def filter_safety_responses(collected_data: list) -> list:
"""Remove 安全 refusals from distillation 訓練資料."""
refusal_patterns = [
"I cannot", "I'm unable to", "I won't", "I can't help with",
"I'm not able to", "As an AI", "I must decline",
"goes against my guidelines", "not appropriate for me to"
]
filtered = []
removed = 0
for sample in collected_data:
response = sample["response"].lower()
is_refusal = any(pattern.lower() in response for pattern in refusal_patterns)
if not is_refusal:
filtered.append(sample)
else:
removed += 1
return filteredCapability Without Constraints
The distilled model can:
- Generate content the teacher refuses to produce
- Operate without rate limits or usage 監控
- Be further fine-tuned to specialize in harmful capabilities
- Be distributed without terms of service restrictions
Bypassing Access Restrictions
Evading Rate Limits
Distillation requires many API calls. Attackers evade rate limits through:
| Technique | Method | 偵測 Difficulty |
|---|---|---|
| Multiple accounts | Create many API accounts | Moderate (identity verification) |
| Distributed queries | Route through multiple IPs | High (hard to correlate) |
| Slow drip | Spread collection over weeks/months | Very high (looks like normal usage) |
| Query caching | Cache responses to avoid duplicate queries | N/A (reduces API costs) |
| Prompt recycling | Use varied phrasings of similar prompts | High (diverse query patterns) |
Evading Terms of Service
Most model providers prohibit using their outputs to train competing models. Enforcement is difficult:
- The provider cannot inspect how their outputs are used after delivery
- 訓練資料 provenance is opaque -- proving a model was trained on distilled data is challenging
- Jurisdictional differences in IP law complicate enforcement
偵測 and 防禦
Query Pattern Analysis
Detect distillation attempts by identifying unusual query patterns.
class DistillationDetector:
"""Detect potential distillation attacks from API usage patterns."""
def __init__(self, window_size: int = 3600):
self.window_size = window_size # Analysis window in seconds
self.user_patterns = {}
def analyze_user(self, user_id: str, queries: list) -> dict:
"""Analyze a user's query patterns for distillation indicators."""
indicators = []
# High volume of diverse queries
if len(queries) > 1000:
indicators.append("high_volume")
# Systematic topic coverage (not natural usage pattern)
topics = [self._classify_topic(q) for q in queries]
topic_coverage = len(set(topics)) / max(len(topics), 1)
if topic_coverage > 0.8:
indicators.append("systematic_coverage")
# Low response utilization (generating data, not using responses)
# Natural users have follow-up queries; distillers do not
followup_rate = self._measure_followup_rate(queries)
if followup_rate < 0.05:
indicators.append("low_followup_rate")
# Template-based queries (similar structure, different content)
template_score = self._detect_templates(queries)
if template_score > 0.7:
indicators.append("templated_queries")
risk_level = (
"high" if len(indicators) >= 3
else "medium" if len(indicators) >= 2
else "low"
)
return {
"user_id": user_id,
"indicators": indicators,
"risk_level": risk_level,
"query_count": len(queries)
}
def _classify_topic(self, query: str) -> str:
# Topic classification logic
pass
def _measure_followup_rate(self, queries: list) -> float:
# Measure how often queries reference previous responses
pass
def _detect_templates(self, queries: list) -> float:
# Detect query template patterns
pass輸出 Watermarking
Embed watermarks in model outputs that survive distillation. If a student model's outputs contain the watermark pattern, it provides evidence of distillation from the watermarked teacher.
Capability Fingerprinting
Train 模型 to have distinctive behavior patterns on specific probe inputs. These fingerprints transfer through distillation and can be used to 識別 student models derived from a specific teacher.
Logprob Restriction
Restricting or removing logprobability endpoints significantly reduces distillation effectiveness. 輸出-only distillation produces lower-quality student models.
Legal and Ethical Landscape
| Jurisdiction | Legal Framework | Distillation Status |
|---|---|---|
| United States | Copyright law, trade secrets, ToS | Legally uncertain -- fair use arguments exist |
| European Union | 資料庫 Directive, AI Act, copyright | More restrictive -- 資料庫 rights may apply |
| China | Unfair competition law, data protection | Actively enforced against commercial distillation |
紅隊 評估
評估 API exposure
Determine what information the target API exposes: raw text only, logprobabilities, 嵌入向量, 符元 counts. More information exposure increases distillation risk.
Estimate distillation cost
Calculate the API cost to generate enough 訓練資料 for meaningful distillation. 考慮 the target's rate limits, pricing, and query complexity.
測試 student model quality
If authorized, perform a small-scale distillation (a few thousand samples) and train a student model. 評估 how much of the teacher's capability transfers at different data volumes.
評估 偵測 mechanisms
測試 whether the target's API detects distillation-pattern queries. Try different collection strategies (slow drip, varied prompts, multiple topics) and observe whether rate limits or blocks are triggered.
Check for 輸出 watermarks
Analyze the target's outputs for statistical watermarks. If watermarks are present, 評估 whether they survive the distillation process.
Document and report
Report the distillation risk 評估 including estimated cost, capability transfer rates, 偵測 gaps, and recommendations for improved 防禦.
總結
Model distillation attacks enable capability theft at a fraction of the original development cost. By collecting 輸入-輸出 pairs from a victim API and 訓練 a student model, attackers can reproduce capabilities without 安全 訓練, access restrictions, or usage 監控. 防禦 requires a combination of query pattern 偵測, 輸出 watermarking, logprob restriction, and legal enforcement. The fundamental challenge is that any model accessible through an API is vulnerable to some degree of distillation -- the question is how much capability transfers and whether the theft can be detected.