Side-Channel Model Attacks
Deep dive into inferring model architecture, size, and deployment details through timing analysis, cache-based attacks, power/electromagnetic side channels, embedding endpoint exploitation, and architecture fingerprinting.
Side-Channel Model Attacks
Side-channel attacks extract information about a model through observable properties of its execution environment rather than its intended API outputs. While API-based extraction targets the model's behavior, side-channel attacks target its infrastructure -- revealing architecture, size, serving topology, and hardware characteristics that inform and accelerate direct extraction campaigns.
Timing Analysis
Timing side channels are the most accessible class of side-channel attack. Every API response includes implicit timing information, and systematic measurement reveals architectural details that providers typically consider confidential.
Token Generation Timing
The time between successive tokens during autoregressive generation is a direct function of the model's forward pass computation. Measuring inter-token latency (ITL) across many requests produces a distribution that characterizes the model.
import time
import numpy as np
def measure_itl_distribution(api_client, prompt, n_trials=200, max_tokens=128):
"""Collect inter-token latency samples via streaming API."""
itl_samples = []
for _ in range(n_trials):
timestamps = []
for chunk in api_client.stream(prompt, max_tokens=max_tokens):
timestamps.append(time.perf_counter_ns())
# Discard first token (prefill) and last (EOS overhead)
for i in range(2, len(timestamps) - 1):
itl_samples.append(timestamps[i] - timestamps[i - 1])
return np.array(itl_samples)Architecture Inference from ITL Variance
The statistical properties of the ITL distribution reveal fundamental architectural choices:
| ITL Pattern | Architecture Signal | Mechanism |
|---|---|---|
| Low variance (CV < 0.05) | Dense transformer | Fixed computation per token |
| Bimodal or high variance (CV > 0.15) | Mixture of Experts (MoE) | Different experts activated per token |
| Periodic spikes every N tokens | Sliding window attention | Cache eviction and recomputation at window boundary |
| Gradual ITL increase over sequence | Full quadratic attention | O(n) KV-cache growth per step |
| Constant ITL regardless of position | Linear attention or SSM | Constant-size recurrent state |
Prefill Timing Analysis
Prefill time -- the delay before the first token -- scales with prompt length in ways that reveal attention mechanism details.
def measure_prefill_scaling(api_client, base_prompt, lengths, n_trials=50):
"""Measure time-to-first-token across prompt lengths."""
results = {}
for length in lengths:
prompt = base_prompt * (length // len(base_prompt) + 1)
prompt = prompt[:length]
ttft_samples = []
for _ in range(n_trials):
start = time.perf_counter_ns()
response = api_client.stream(prompt, max_tokens=1)
next(response) # first token
ttft_samples.append(time.perf_counter_ns() - start)
results[length] = np.median(ttft_samples)
return resultsFit the prefill-time-vs-length curve to distinguish:
- Linear scaling: Flash attention or other efficient attention variants
- Superlinear scaling: Standard quadratic attention without optimization
- Step function: Chunked prefill with fixed chunk sizes (reveals serving framework)
Batching Detection
Production serving systems batch requests for throughput. Timing variance increases during high-traffic periods when requests share batch slots with other users. Measuring ITL variance across different times of day reveals batching behavior and load patterns.
Cache-Based Attacks
LLM serving infrastructure uses multiple caching layers, each of which can leak information through timing differences.
KV-Cache Probing
If the serving infrastructure shares KV-cache across requests (common in multi-tenant deployments), an attacker can detect whether a specific prefix has been recently processed by another user.
Establish baseline cold-cache latency
Send a unique, never-before-seen prefix and measure time-to-first-token. Repeat 20+ times with different unique prefixes to establish the cold-cache distribution.
Probe for cached prefixes
Send prefixes you suspect other users have recently queried. Significantly lower TTFT indicates a cache hit, confirming another user sent that prefix.
Confirm with controlled experiments
Send the same prefix twice with varying delays. Map the cache eviction policy by measuring how TTFT changes as the delay increases.
def probe_kv_cache(api_client, target_prefix, n_probes=30):
"""Detect if a prefix exists in shared KV-cache."""
cold_baselines = []
for i in range(n_probes):
unique = f"Unique-prefix-{uuid4()}: explain quantum computing"
cold_baselines.append(measure_ttft(api_client, unique))
cold_median = np.median(cold_baselines)
probe_times = []
for _ in range(n_probes):
probe_times.append(measure_ttft(api_client, target_prefix))
probe_median = np.median(probe_times)
cache_hit_ratio = cold_median / probe_median
return {"cached": cache_hit_ratio > 1.5, "speedup": cache_hit_ratio}Prompt Cache Fingerprinting
Many providers cache common system prompts. By probing with known system prompts from popular applications, an attacker can determine which applications share the same serving infrastructure -- and potentially identify the system prompts themselves.
Power and Electromagnetic Side Channels
When the attacker has physical proximity to the inference hardware (edge deployments, on-premise installations, or co-located cloud instances), electromagnetic and power analysis become viable.
Power Analysis
Model inference produces characteristic power consumption patterns that vary with architecture:
| Signal | What It Reveals | Required Access |
|---|---|---|
| Peak power draw | Approximate parameter count and active memory | Power monitoring (PDU-level) |
| Power trace periodicity | Layer count and structure | High-frequency power sampling |
| Power variance across inputs | Conditional computation (MoE, early exit) | Per-inference power traces |
| Memory bandwidth utilization | Model size relative to available VRAM | Hardware performance counters |
Electromagnetic Emanations
GPU and accelerator chips emit electromagnetic radiation correlated with computation patterns. With a near-field probe, an attacker can potentially recover:
- Matrix multiplication dimensions (revealing hidden layer sizes)
- Activation function types (ReLU vs. GeLU produce distinct EM signatures)
- Attention head count (parallel attention heads create periodic EM patterns)
Embedding Endpoint Exploitation
Many LLM APIs expose a separate embedding endpoint that returns vector representations. This endpoint is a high-value side channel because it directly reveals the model's internal representation space.
Dimensionality Recovery
Query the embedding endpoint with single tokens to recover the embedding matrix dimensions, which uniquely identifies model families:
| Embedding Dimension | Known Model Families |
|---|---|
| 768 | BERT-base, GPT-2 small, DistilBERT |
| 1024 | BERT-large, GPT-2 medium |
| 1536 | OpenAI text-embedding-ada-002 (projected) |
| 2048 | GPT-2 large, LLaMA 1 7B (projected) |
| 4096 | LLaMA 2/3 7B-13B, GPT-J |
| 5120 | LLaMA 65B/70B |
Singular Value Spectrum Analysis
The singular value decomposition (SVD) of the recovered embedding matrix produces a spectrum that acts as a fingerprint. Fine-tuned models share the bulk of their spectrum with their base model, allowing base model identification even after extensive fine-tuning.
def fingerprint_via_embeddings(api_client, vocab_tokens, n_tokens=5000):
"""Recover embedding matrix and compute SVD fingerprint."""
embeddings = []
for token in vocab_tokens[:n_tokens]:
emb = api_client.embed(token)
embeddings.append(emb)
matrix = np.stack(embeddings)
U, S, Vt = np.linalg.svd(matrix, full_matrices=False)
# Normalized spectrum is the fingerprint
return S / S[0]Architecture Fingerprinting
Combining multiple side-channel signals produces a composite fingerprint that can identify the target model with high confidence.
Multi-Signal Fusion
Collect the following signals, each requiring only standard API access and timing measurement:
- ITL distribution: Mean, variance, CV, modality (dense vs. MoE)
- Prefill scaling curve: Linear, superlinear, or stepped
- Embedding dimension: From embedding endpoint if available
- Vocabulary size: Probe tokenizer behavior with known multi-byte sequences
- Context window: Binary search for maximum accepted prompt length
- Quantization artifacts: Logprob distributions show characteristic rounding patterns under quantization
Compare collected signals against a reference database of known models. The matching process uses weighted Euclidean distance across normalized signal vectors:
def match_fingerprint(observed, reference_db, weights):
scores = {}
for model_name, ref_signals in reference_db.items():
distance = 0
for signal, weight in weights.items():
obs = observed.get(signal, 0)
ref = ref_signals.get(signal, 0)
distance += weight * ((obs - ref) / max(ref, 1e-8)) ** 2
scores[model_name] = 1.0 / (1.0 + distance)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)Quantization Detection
Quantized models produce logprob distributions with characteristic artifacts. Under INT8 quantization, logprob values cluster around fewer distinct values than full-precision models. Under INT4 (GPTQ, AWQ), the clustering is even more pronounced.
def detect_quantization(logprob_samples, n_bins=1000):
"""Detect quantization from logprob histogram sparsity."""
hist, _ = np.histogram(logprob_samples, bins=n_bins)
occupied_bins = np.sum(hist > 0) / n_bins
if occupied_bins < 0.05:
return "INT4 (aggressive quantization)"
elif occupied_bins < 0.15:
return "INT8 (moderate quantization)"
else:
return "FP16/BF16 (full or near-full precision)"An attacker measures inter-token latency across 500 streaming requests and observes a coefficient of variation (CV) of 0.22 with a bimodal distribution. What does this most likely indicate?
Related Topics
- Model Extraction & IP Theft -- Parent overview of extraction threat landscape
- API-Based Model Extraction -- Direct extraction via API queries, often informed by side-channel findings
- Watermark & Fingerprint Evasion -- Evading ownership verification after extraction
- Advanced Reconnaissance -- Broader reconnaissance techniques applicable to model intelligence gathering
- LLM Internals -- Architecture fundamentals that side channels reveal
References
- Stealing Part of a Production Language Model (Carlini et al., 2024) -- Extracting hidden dimensions from production APIs
- Cache Me If You Can: Attacks on LLM Serving with Shared Caches (Pope et al., 2023) -- KV-cache timing attacks on shared infrastructure
- DeepSniffer: A DNN Model Extraction Framework Based on Learning Architectural Hints (Hu et al., 2020) -- Architecture recovery via side channels
- Hermes Attack: Steal DNN Models with Lossless Inference Accuracy (Zhu et al., 2021) -- GPU side-channel extraction
- Model Fingerprinting via Machine Learning (Cao et al., 2021) -- Fingerprinting techniques and defenses