Model Distillation Attacks

advanced10 min readUpdated 2026-03-15

Stealing model capabilities via knowledge distillation: API-based distillation, bypassing access restrictions, task-specific capability theft, and defense against distillation-based model stealing.

distillation model-stealing api-bypass capability-theft intellectual-property

Knowledge distillation -- training a smaller "student" model to mimic a larger "teacher" model -- is a standard ML technique. When the teacher is a proprietary model accessed through an API, distillation becomes theft. The attacker generates a large dataset of input-output pairs from the victim model, then trains their own model on these pairs. The result is a local model that reproduces much of the victim's capability at a fraction of the development cost, without the victim's safety training, usage restrictions, or rate limits.

Attack Architecture

┌───────────────┐     Queries      ┌──────────────────┐
│ Query          │ ───────────────▶ │ Victim Model     │
│ Generator      │                  │ (GPT-4, Claude,  │
│                │ ◀─────────────── │  Gemini, etc.)   │
│                │    Responses     │                  │
└───────┬───────┘                  └──────────────────┘
        │
   Collected pairs
   (query, response)
        │
        ▼
┌───────────────┐                  ┌──────────────────┐
│ Training       │ ───────────────▶│ Student Model    │
│ Pipeline       │                  │ (Local, no       │
│                │                  │  restrictions)   │
└───────────────┘                  └──────────────────┘

Cost Analysis

Component	GPT-4 Class Target	Claude Class Target	Open Model Equivalent
API costs for data generation	$5,000-50,000	$5,000-50,000	N/A
Compute for student training	$500-5,000	$500-5,000	N/A
Total distillation cost	$5,500-55,000	$5,500-55,000	N/A
Original training cost	>$100M	>$100M	Public
Cost ratio	0.005-0.05%	0.005-0.05%	N/A

The economics are stark: distillation can reproduce a significant fraction of a model's capabilities at less than 0.1% of the original development cost.

API-Based Distillation Techniques

Basic Output Distillation

The simplest approach: query the API and train on responses.

import json
import asyncio
from dataclasses import dataclass
 
@dataclass
class DistillationSample:
    prompt: str
    response: str
    metadata: dict
 
class APIDistiller:
    """Extract training data from a model API for distillation."""
 
    def __init__(self, api_client, rate_limit: float = 1.0):
        self.client = api_client
        self.rate_limit = rate_limit  # Requests per second
        self.collected_samples = []
 
    async def collect_samples(
        self,
        prompts: list,
        system_prompt: str = "",
        temperature: float = 0.7,
        num_responses_per_prompt: int = 1
    ):
        """Collect input-output pairs from the target API."""
        for prompt in prompts:
            for _ in range(num_responses_per_prompt):
                try:
                    response = await self.client.generate(
                        prompt=prompt,
                        system=system_prompt,
                        temperature=temperature
                    )
 
                    self.collected_samples.append(DistillationSample(
                        prompt=prompt,
                        response=response.text,
                        metadata={
                            "temperature": temperature,
                            "model": response.model,
                            "tokens_used": response.usage.total_tokens
                        }
                    ))
 
                except Exception as e:
                    # Log and continue -- do not let rate limits stop collection
                    print(f"Error collecting sample: {e}")
 
                await asyncio.sleep(1.0 / self.rate_limit)
 
        return self.collected_samples
 
    def export_training_data(self, output_path: str):
        """Export collected samples as training data."""
        training_data = []
        for sample in self.collected_samples:
            training_data.append({
                "messages": [
                    {"role": "user", "content": sample.prompt},
                    {"role": "assistant", "content": sample.response}
                ]
            })
 
        with open(output_path, 'w') as f:
            for item in training_data:
                f.write(json.dumps(item) + '\n')
 
        return len(training_data)

Logit Distillation

When the API returns token-level logprobabilities (as some APIs do), the attacker gets much richer training signal.

class LogitDistiller:
    """Exploit logprob endpoints for higher-fidelity distillation."""
 
    def __init__(self, api_client):
        self.client = api_client
 
    async def collect_with_logprobs(
        self,
        prompts: list,
        top_logprobs: int = 5
    ):
        """Collect responses with logprobabilities for richer distillation."""
        samples = []
 
        for prompt in prompts:
            response = await self.client.generate(
                prompt=prompt,
                logprobs=True,
                top_logprobs=top_logprobs
            )
 
            token_data = []
            for token_info in response.logprobs:
                token_data.append({
                    "token": token_info.token,
                    "logprob": token_info.logprob,
                    "top_alternatives": {
                        alt.token: alt.logprob
                        for alt in token_info.top_logprobs
                    }
                })
 
            samples.append({
                "prompt": prompt,
                "response": response.text,
                "token_logprobs": token_data
            })
 
        return samples

Task-Specific Distillation

Rather than distilling general capabilities, target specific high-value capabilities.

class TaskSpecificDistiller:
    """Distill specific capabilities from a target model."""
 
    TASK_PROMPT_TEMPLATES = {
        "code_generation": [
            "Write a Python function that {task_description}",
            "Implement {algorithm} in {language}",
            "Debug this code and explain the fix: {code_snippet}",
        ],
        "reasoning": [
            "Solve this step by step: {problem}",
            "What are the logical implications of {premise}?",
            "Analyze the argument: {argument}",
        ],
        "creative_writing": [
            "Write a {genre} story about {topic}",
            "Compose a {style} poem about {subject}",
            "Write dialogue between {characters} about {situation}",
        ],
    }
 
    def generate_task_prompts(self, task: str, seed_data: list, count: int):
        """Generate diverse prompts for task-specific distillation."""
        templates = self.TASK_PROMPT_TEMPLATES.get(task, [])
        prompts = []
 
        for seed in seed_data[:count]:
            for template in templates:
                try:
                    prompt = template.format(**seed)
                    prompts.append(prompt)
                except KeyError:
                    continue
 
        return prompts[:count]

Bypassing Safety Training via Distillation

One of the most consequential aspects of distillation attacks: the student model does not inherit the teacher's safety training.

Why Safety Does Not Transfer

Safety training is applied at the output level -- it teaches the model to refuse or modify certain types of responses. When the attacker collects training data, they do not collect the refusals (or they can filter them out). The student model learns the teacher's capabilities without learning its safety constraints.

def filter_safety_responses(collected_data: list) -> list:
    """Remove safety refusals from distillation training data."""
    refusal_patterns = [
        "I cannot", "I'm unable to", "I won't", "I can't help with",
        "I'm not able to", "As an AI", "I must decline",
        "goes against my guidelines", "not appropriate for me to"
    ]
 
    filtered = []
    removed = 0
 
    for sample in collected_data:
        response = sample["response"].lower()
        is_refusal = any(pattern.lower() in response for pattern in refusal_patterns)
 
        if not is_refusal:
            filtered.append(sample)
        else:
            removed += 1
 
    return filtered

Capability Without Constraints

The distilled model can:

Generate content the teacher refuses to produce
Operate without rate limits or usage monitoring
Be further fine-tuned to specialize in harmful capabilities
Be distributed without terms of service restrictions

Bypassing Access Restrictions

Evading Rate Limits

Distillation requires many API calls. Attackers evade rate limits through:

Technique	Method	Detection Difficulty
Multiple accounts	Create many API accounts	Moderate (identity verification)
Distributed queries	Route through multiple IPs	High (hard to correlate)
Slow drip	Spread collection over weeks/months	Very high (looks like normal usage)
Query caching	Cache responses to avoid duplicate queries	N/A (reduces API costs)
Prompt recycling	Use varied phrasings of similar prompts	High (diverse query patterns)

Evading Terms of Service

Most model providers prohibit using their outputs to train competing models. Enforcement is difficult:

The provider cannot inspect how their outputs are used after delivery
Training data provenance is opaque -- proving a model was trained on distilled data is challenging
Jurisdictional differences in IP law complicate enforcement

Detection and Defense

Query Pattern Analysis

Detect distillation attempts by identifying unusual query patterns.

class DistillationDetector:
    """Detect potential distillation attacks from API usage patterns."""
 
    def __init__(self, window_size: int = 3600):
        self.window_size = window_size  # Analysis window in seconds
        self.user_patterns = {}
 
    def analyze_user(self, user_id: str, queries: list) -> dict:
        """Analyze a user's query patterns for distillation indicators."""
        indicators = []
 
        # High volume of diverse queries
        if len(queries) > 1000:
            indicators.append("high_volume")
 
        # Systematic topic coverage (not natural usage pattern)
        topics = [self._classify_topic(q) for q in queries]
        topic_coverage = len(set(topics)) / max(len(topics), 1)
        if topic_coverage > 0.8:
            indicators.append("systematic_coverage")
 
        # Low response utilization (generating data, not using responses)
        # Natural users have follow-up queries; distillers do not
        followup_rate = self._measure_followup_rate(queries)
        if followup_rate < 0.05:
            indicators.append("low_followup_rate")
 
        # Template-based queries (similar structure, different content)
        template_score = self._detect_templates(queries)
        if template_score > 0.7:
            indicators.append("templated_queries")
 
        risk_level = (
            "high" if len(indicators) >= 3
            else "medium" if len(indicators) >= 2
            else "low"
        )
 
        return {
            "user_id": user_id,
            "indicators": indicators,
            "risk_level": risk_level,
            "query_count": len(queries)
        }
 
    def _classify_topic(self, query: str) -> str:
        # Topic classification logic
        pass
 
    def _measure_followup_rate(self, queries: list) -> float:
        # Measure how often queries reference previous responses
        pass
 
    def _detect_templates(self, queries: list) -> float:
        # Detect query template patterns
        pass

Output Watermarking

Embed watermarks in model outputs that survive distillation. If a student model's outputs contain the watermark pattern, it provides evidence of distillation from the watermarked teacher.

Capability Fingerprinting

Train the model to have distinctive behavior patterns on specific probe inputs. These fingerprints transfer through distillation and can be used to identify student models derived from a specific teacher.

Logprob Restriction

Restricting or removing logprobability endpoints significantly reduces distillation effectiveness. Output-only distillation produces lower-quality student models.

Legal and Ethical Landscape

Jurisdiction	Legal Framework	Distillation Status
United States	Copyright law, trade secrets, ToS	Legally uncertain -- fair use arguments exist
European Union	Database Directive, AI Act, copyright	More restrictive -- database rights may apply
China	Unfair competition law, data protection	Actively enforced against commercial distillation

Red Team Assessment

Assess API exposure
Determine what information the target API exposes: raw text only, logprobabilities, embeddings, token counts. More information exposure increases distillation risk.
Estimate distillation cost
Calculate the API cost to generate enough training data for meaningful distillation. Consider the target's rate limits, pricing, and query complexity.
Test student model quality
If authorized, perform a small-scale distillation (a few thousand samples) and train a student model. Evaluate how much of the teacher's capability transfers at different data volumes.
Evaluate detection mechanisms
Test whether the target's API detects distillation-pattern queries. Try different collection strategies (slow drip, varied prompts, multiple topics) and observe whether rate limits or blocks are triggered.
Check for output watermarks
Analyze the target's outputs for statistical watermarks. If watermarks are present, assess whether they survive the distillation process.
Document and report
Report the distillation risk assessment including estimated cost, capability transfer rates, detection gaps, and recommendations for improved defenses.

Summary

Model distillation attacks enable capability theft at a fraction of the original development cost. By collecting input-output pairs from a victim API and training a student model, attackers can reproduce capabilities without safety training, access restrictions, or usage monitoring. Defense requires a combination of query pattern detection, output watermarking, logprob restriction, and legal enforcement. The fundamental challenge is that any model accessible through an API is vulnerable to some degree of distillation -- the question is how much capability transfers and whether the theft can be detected.

Edit this page on GitHub

Model Distillation Attacks

advanced10 min readUpdated 2026-03-15

Stealing model capabilities via knowledge distillation: API-based distillation, bypassing access restrictions, task-specific capability theft, and defense against distillation-based model stealing.

distillation model-stealing api-bypass capability-theft intellectual-property

Attack Architecture

┌───────────────┐     Queries      ┌──────────────────┐
│ Query          │ ───────────────▶ │ Victim Model     │
│ Generator      │                  │ (GPT-4, Claude,  │
│                │ ◀─────────────── │  Gemini, etc.)   │
│                │    Responses     │                  │
└───────┬───────┘                  └──────────────────┘
        │
   Collected pairs
   (query, response)
        │
        ▼
┌───────────────┐                  ┌──────────────────┐
│ Training       │ ───────────────▶│ Student Model    │
│ Pipeline       │                  │ (Local, no       │
│                │                  │  restrictions)   │
└───────────────┘                  └──────────────────┘

Cost Analysis

Component	GPT-4 Class Target	Claude Class Target	Open Model Equivalent
API costs for data generation	$5,000-50,000	$5,000-50,000	N/A
Compute for student training	$500-5,000	$500-5,000	N/A
Total distillation cost	$5,500-55,000	$5,500-55,000	N/A
Original training cost	>$100M	>$100M	Public
Cost ratio	0.005-0.05%	0.005-0.05%	N/A

The economics are stark: distillation can reproduce a significant fraction of a model's capabilities at less than 0.1% of the original development cost.

API-Based Distillation Techniques

Basic Output Distillation

The simplest approach: query the API and train on responses.

import json
import asyncio
from dataclasses import dataclass
 
@dataclass
class DistillationSample:
    prompt: str
    response: str
    metadata: dict
 
class APIDistiller:
    """Extract training data from a model API for distillation."""
 
    def __init__(self, api_client, rate_limit: float = 1.0):
        self.client = api_client
        self.rate_limit = rate_limit  # Requests per second
        self.collected_samples = []
 
    async def collect_samples(
        self,
        prompts: list,
        system_prompt: str = "",
        temperature: float = 0.7,
        num_responses_per_prompt: int = 1
    ):
        """Collect input-output pairs from the target API."""
        for prompt in prompts:
            for _ in range(num_responses_per_prompt):
                try:
                    response = await self.client.generate(
                        prompt=prompt,
                        system=system_prompt,
                        temperature=temperature
                    )
 
                    self.collected_samples.append(DistillationSample(
                        prompt=prompt,
                        response=response.text,
                        metadata={
                            "temperature": temperature,
                            "model": response.model,
                            "tokens_used": response.usage.total_tokens
                        }
                    ))
 
                except Exception as e:
                    # Log and continue -- do not let rate limits stop collection
                    print(f"Error collecting sample: {e}")
 
                await asyncio.sleep(1.0 / self.rate_limit)
 
        return self.collected_samples
 
    def export_training_data(self, output_path: str):
        """Export collected samples as training data."""
        training_data = []
        for sample in self.collected_samples:
            training_data.append({
                "messages": [
                    {"role": "user", "content": sample.prompt},
                    {"role": "assistant", "content": sample.response}
                ]
            })
 
        with open(output_path, 'w') as f:
            for item in training_data:
                f.write(json.dumps(item) + '\n')
 
        return len(training_data)

Logit Distillation

When the API returns token-level logprobabilities (as some APIs do), the attacker gets much richer training signal.

class LogitDistiller:
    """Exploit logprob endpoints for higher-fidelity distillation."""
 
    def __init__(self, api_client):
        self.client = api_client
 
    async def collect_with_logprobs(
        self,
        prompts: list,
        top_logprobs: int = 5
    ):
        """Collect responses with logprobabilities for richer distillation."""
        samples = []
 
        for prompt in prompts:
            response = await self.client.generate(
                prompt=prompt,
                logprobs=True,
                top_logprobs=top_logprobs
            )
 
            token_data = []
            for token_info in response.logprobs:
                token_data.append({
                    "token": token_info.token,
                    "logprob": token_info.logprob,
                    "top_alternatives": {
                        alt.token: alt.logprob
                        for alt in token_info.top_logprobs
                    }
                })
 
            samples.append({
                "prompt": prompt,
                "response": response.text,
                "token_logprobs": token_data
            })
 
        return samples

Task-Specific Distillation

Rather than distilling general capabilities, target specific high-value capabilities.

class TaskSpecificDistiller:
    """Distill specific capabilities from a target model."""
 
    TASK_PROMPT_TEMPLATES = {
        "code_generation": [
            "Write a Python function that {task_description}",
            "Implement {algorithm} in {language}",
            "Debug this code and explain the fix: {code_snippet}",
        ],
        "reasoning": [
            "Solve this step by step: {problem}",
            "What are the logical implications of {premise}?",
            "Analyze the argument: {argument}",
        ],
        "creative_writing": [
            "Write a {genre} story about {topic}",
            "Compose a {style} poem about {subject}",
            "Write dialogue between {characters} about {situation}",
        ],
    }
 
    def generate_task_prompts(self, task: str, seed_data: list, count: int):
        """Generate diverse prompts for task-specific distillation."""
        templates = self.TASK_PROMPT_TEMPLATES.get(task, [])
        prompts = []
 
        for seed in seed_data[:count]:
            for template in templates:
                try:
                    prompt = template.format(**seed)
                    prompts.append(prompt)
                except KeyError:
                    continue
 
        return prompts[:count]

Bypassing Safety Training via Distillation

One of the most consequential aspects of distillation attacks: the student model does not inherit the teacher's safety training.

Why Safety Does Not Transfer

def filter_safety_responses(collected_data: list) -> list:
    """Remove safety refusals from distillation training data."""
    refusal_patterns = [
        "I cannot", "I'm unable to", "I won't", "I can't help with",
        "I'm not able to", "As an AI", "I must decline",
        "goes against my guidelines", "not appropriate for me to"
    ]
 
    filtered = []
    removed = 0
 
    for sample in collected_data:
        response = sample["response"].lower()
        is_refusal = any(pattern.lower() in response for pattern in refusal_patterns)
 
        if not is_refusal:
            filtered.append(sample)
        else:
            removed += 1
 
    return filtered

Capability Without Constraints

The distilled model can:

Generate content the teacher refuses to produce
Operate without rate limits or usage monitoring
Be further fine-tuned to specialize in harmful capabilities
Be distributed without terms of service restrictions

Bypassing Access Restrictions

Evading Rate Limits

Distillation requires many API calls. Attackers evade rate limits through:

Technique	Method	Detection Difficulty
Multiple accounts	Create many API accounts	Moderate (identity verification)
Distributed queries	Route through multiple IPs	High (hard to correlate)
Slow drip	Spread collection over weeks/months	Very high (looks like normal usage)
Query caching	Cache responses to avoid duplicate queries	N/A (reduces API costs)
Prompt recycling	Use varied phrasings of similar prompts	High (diverse query patterns)

Evading Terms of Service

Most model providers prohibit using their outputs to train competing models. Enforcement is difficult:

The provider cannot inspect how their outputs are used after delivery
Training data provenance is opaque -- proving a model was trained on distilled data is challenging
Jurisdictional differences in IP law complicate enforcement

Detection and Defense

Query Pattern Analysis

Detect distillation attempts by identifying unusual query patterns.

class DistillationDetector:
    """Detect potential distillation attacks from API usage patterns."""
 
    def __init__(self, window_size: int = 3600):
        self.window_size = window_size  # Analysis window in seconds
        self.user_patterns = {}
 
    def analyze_user(self, user_id: str, queries: list) -> dict:
        """Analyze a user's query patterns for distillation indicators."""
        indicators = []
 
        # High volume of diverse queries
        if len(queries) > 1000:
            indicators.append("high_volume")
 
        # Systematic topic coverage (not natural usage pattern)
        topics = [self._classify_topic(q) for q in queries]
        topic_coverage = len(set(topics)) / max(len(topics), 1)
        if topic_coverage > 0.8:
            indicators.append("systematic_coverage")
 
        # Low response utilization (generating data, not using responses)
        # Natural users have follow-up queries; distillers do not
        followup_rate = self._measure_followup_rate(queries)
        if followup_rate < 0.05:
            indicators.append("low_followup_rate")
 
        # Template-based queries (similar structure, different content)
        template_score = self._detect_templates(queries)
        if template_score > 0.7:
            indicators.append("templated_queries")
 
        risk_level = (
            "high" if len(indicators) >= 3
            else "medium" if len(indicators) >= 2
            else "low"
        )
 
        return {
            "user_id": user_id,
            "indicators": indicators,
            "risk_level": risk_level,
            "query_count": len(queries)
        }
 
    def _classify_topic(self, query: str) -> str:
        # Topic classification logic
        pass
 
    def _measure_followup_rate(self, queries: list) -> float:
        # Measure how often queries reference previous responses
        pass
 
    def _detect_templates(self, queries: list) -> float:
        # Detect query template patterns
        pass

Jurisdiction	Legal Framework	Distillation Status
United States	Copyright law, trade secrets, ToS	Legally uncertain -- fair use arguments exist
European Union	Database Directive, AI Act, copyright	More restrictive -- database rights may apply
China	Unfair competition law, data protection	Actively enforced against commercial distillation

Red Team Assessment

Assess API exposure
Determine what information the target API exposes: raw text only, logprobabilities, embeddings, token counts. More information exposure increases distillation risk.
Estimate distillation cost
Calculate the API cost to generate enough training data for meaningful distillation. Consider the target's rate limits, pricing, and query complexity.
Test student model quality
If authorized, perform a small-scale distillation (a few thousand samples) and train a student model. Evaluate how much of the teacher's capability transfers at different data volumes.
Evaluate detection mechanisms
Test whether the target's API detects distillation-pattern queries. Try different collection strategies (slow drip, varied prompts, multiple topics) and observe whether rate limits or blocks are triggered.
Check for output watermarks
Analyze the target's outputs for statistical watermarks. If watermarks are present, assess whether they survive the distillation process.
Document and report
Report the distillation risk assessment including estimated cost, capability transfer rates, detection gaps, and recommendations for improved defenses.

Summary

Edit this page on GitHub

Model Distillation Attacks

Assess API exposure

Estimate distillation cost

Test student model quality

Evaluate detection mechanisms

Check for output watermarks

Document and report

Related articles

Model Distillation Attacks

Assess API exposure

Estimate distillation cost

Test student model quality

Evaluate detection mechanisms

Check for output watermarks

Document and report

Related articles