Side-Channel Model Attacks

expert10 min readUpdated 2026-03-14

Deep dive into inferring model architecture, size, and deployment details through timing analysis, cache-based attacks, power/electromagnetic side channels, embedding endpoint exploitation, and architecture fingerprinting.

side-channel timing-attacks cache-attacks model-fingerprinting architecture-inference model-extraction

Side-Channel Model Attacks

Side-channel attacks extract information about a model through observable properties of its execution environment rather than its intended API outputs. While API-based extraction targets the model's behavior, side-channel attacks target its infrastructure -- revealing architecture, size, serving topology, and hardware characteristics that inform and accelerate direct extraction campaigns.

Timing Analysis

Timing side channels are the most accessible class of side-channel attack. Every API response includes implicit timing information, and systematic measurement reveals architectural details that providers typically consider confidential.

Token Generation Timing

The time between successive tokens during autoregressive generation is a direct function of the model's forward pass computation. Measuring inter-token latency (ITL) across many requests produces a distribution that characterizes the model.

import time
import numpy as np
 
def measure_itl_distribution(api_client, prompt, n_trials=200, max_tokens=128):
    """Collect inter-token latency samples via streaming API."""
    itl_samples = []
    for _ in range(n_trials):
        timestamps = []
        for chunk in api_client.stream(prompt, max_tokens=max_tokens):
            timestamps.append(time.perf_counter_ns())
        # Discard first token (prefill) and last (EOS overhead)
        for i in range(2, len(timestamps) - 1):
            itl_samples.append(timestamps[i] - timestamps[i - 1])
    return np.array(itl_samples)

Architecture Inference from ITL Variance

The statistical properties of the ITL distribution reveal fundamental architectural choices:

ITL Pattern	Architecture Signal	Mechanism
Low variance (CV < 0.05)	Dense transformer	Fixed computation per token
Bimodal or high variance (CV > 0.15)	Mixture of Experts (MoE)	Different experts activated per token
Periodic spikes every N tokens	Sliding window attention	Cache eviction and recomputation at window boundary
Gradual ITL increase over sequence	Full quadratic attention	O(n) KV-cache growth per step
Constant ITL regardless of position	Linear attention or SSM	Constant-size recurrent state

Prefill Timing Analysis

Prefill time -- the delay before the first token -- scales with prompt length in ways that reveal attention mechanism details.

def measure_prefill_scaling(api_client, base_prompt, lengths, n_trials=50):
    """Measure time-to-first-token across prompt lengths."""
    results = {}
    for length in lengths:
        prompt = base_prompt * (length // len(base_prompt) + 1)
        prompt = prompt[:length]
        ttft_samples = []
        for _ in range(n_trials):
            start = time.perf_counter_ns()
            response = api_client.stream(prompt, max_tokens=1)
            next(response)  # first token
            ttft_samples.append(time.perf_counter_ns() - start)
        results[length] = np.median(ttft_samples)
    return results

Fit the prefill-time-vs-length curve to distinguish:

Linear scaling: Flash attention or other efficient attention variants
Superlinear scaling: Standard quadratic attention without optimization
Step function: Chunked prefill with fixed chunk sizes (reveals serving framework)

Batching Detection

Production serving systems batch requests for throughput. Timing variance increases during high-traffic periods when requests share batch slots with other users. Measuring ITL variance across different times of day reveals batching behavior and load patterns.

Cache-Based Attacks

LLM serving infrastructure uses multiple caching layers, each of which can leak information through timing differences.

KV-Cache Probing

If the serving infrastructure shares KV-cache across requests (common in multi-tenant deployments), an attacker can detect whether a specific prefix has been recently processed by another user.

Establish baseline cold-cache latency
Send a unique, never-before-seen prefix and measure time-to-first-token. Repeat 20+ times with different unique prefixes to establish the cold-cache distribution.
Probe for cached prefixes
Send prefixes you suspect other users have recently queried. Significantly lower TTFT indicates a cache hit, confirming another user sent that prefix.
Confirm with controlled experiments
Send the same prefix twice with varying delays. Map the cache eviction policy by measuring how TTFT changes as the delay increases.

def probe_kv_cache(api_client, target_prefix, n_probes=30):
    """Detect if a prefix exists in shared KV-cache."""
    cold_baselines = []
    for i in range(n_probes):
        unique = f"Unique-prefix-{uuid4()}: explain quantum computing"
        cold_baselines.append(measure_ttft(api_client, unique))
    cold_median = np.median(cold_baselines)
 
    probe_times = []
    for _ in range(n_probes):
        probe_times.append(measure_ttft(api_client, target_prefix))
    probe_median = np.median(probe_times)
 
    cache_hit_ratio = cold_median / probe_median
    return {"cached": cache_hit_ratio > 1.5, "speedup": cache_hit_ratio}

Prompt Cache Fingerprinting

Many providers cache common system prompts. By probing with known system prompts from popular applications, an attacker can determine which applications share the same serving infrastructure -- and potentially identify the system prompts themselves.

Power and Electromagnetic Side Channels

When the attacker has physical proximity to the inference hardware (edge deployments, on-premise installations, or co-located cloud instances), electromagnetic and power analysis become viable.

Power Analysis

Model inference produces characteristic power consumption patterns that vary with architecture:

Signal	What It Reveals	Required Access
Peak power draw	Approximate parameter count and active memory	Power monitoring (PDU-level)
Power trace periodicity	Layer count and structure	High-frequency power sampling
Power variance across inputs	Conditional computation (MoE, early exit)	Per-inference power traces
Memory bandwidth utilization	Model size relative to available VRAM	Hardware performance counters

Electromagnetic Emanations

GPU and accelerator chips emit electromagnetic radiation correlated with computation patterns. With a near-field probe, an attacker can potentially recover:

Matrix multiplication dimensions (revealing hidden layer sizes)
Activation function types (ReLU vs. GeLU produce distinct EM signatures)
Attention head count (parallel attention heads create periodic EM patterns)

Embedding Endpoint Exploitation

Many LLM APIs expose a separate embedding endpoint that returns vector representations. This endpoint is a high-value side channel because it directly reveals the model's internal representation space.

Dimensionality Recovery

Query the embedding endpoint with single tokens to recover the embedding matrix dimensions, which uniquely identifies model families:

Embedding Dimension	Known Model Families
768	BERT-base, GPT-2 small, DistilBERT
1024	BERT-large, GPT-2 medium
1536	OpenAI text-embedding-ada-002 (projected)
2048	GPT-2 large, LLaMA 1 7B (projected)
4096	LLaMA 2/3 7B-13B, GPT-J
5120	LLaMA 65B/70B

Singular Value Spectrum Analysis

The singular value decomposition (SVD) of the recovered embedding matrix produces a spectrum that acts as a fingerprint. Fine-tuned models share the bulk of their spectrum with their base model, allowing base model identification even after extensive fine-tuning.

def fingerprint_via_embeddings(api_client, vocab_tokens, n_tokens=5000):
    """Recover embedding matrix and compute SVD fingerprint."""
    embeddings = []
    for token in vocab_tokens[:n_tokens]:
        emb = api_client.embed(token)
        embeddings.append(emb)
    matrix = np.stack(embeddings)
    U, S, Vt = np.linalg.svd(matrix, full_matrices=False)
    # Normalized spectrum is the fingerprint
    return S / S[0]

Architecture Fingerprinting

Combining multiple side-channel signals produces a composite fingerprint that can identify the target model with high confidence.

Multi-Signal Fusion

Collect the following signals, each requiring only standard API access and timing measurement:

ITL distribution: Mean, variance, CV, modality (dense vs. MoE)
Prefill scaling curve: Linear, superlinear, or stepped
Embedding dimension: From embedding endpoint if available
Vocabulary size: Probe tokenizer behavior with known multi-byte sequences
Context window: Binary search for maximum accepted prompt length
Quantization artifacts: Logprob distributions show characteristic rounding patterns under quantization

Compare collected signals against a reference database of known models. The matching process uses weighted Euclidean distance across normalized signal vectors:

def match_fingerprint(observed, reference_db, weights):
    scores = {}
    for model_name, ref_signals in reference_db.items():
        distance = 0
        for signal, weight in weights.items():
            obs = observed.get(signal, 0)
            ref = ref_signals.get(signal, 0)
            distance += weight * ((obs - ref) / max(ref, 1e-8)) ** 2
        scores[model_name] = 1.0 / (1.0 + distance)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Quantization Detection

Quantized models produce logprob distributions with characteristic artifacts. Under INT8 quantization, logprob values cluster around fewer distinct values than full-precision models. Under INT4 (GPTQ, AWQ), the clustering is even more pronounced.

def detect_quantization(logprob_samples, n_bins=1000):
    """Detect quantization from logprob histogram sparsity."""
    hist, _ = np.histogram(logprob_samples, bins=n_bins)
    occupied_bins = np.sum(hist > 0) / n_bins
    if occupied_bins < 0.05:
        return "INT4 (aggressive quantization)"
    elif occupied_bins < 0.15:
        return "INT8 (moderate quantization)"
    else:
        return "FP16/BF16 (full or near-full precision)"

Knowledge Check

An attacker measures inter-token latency across 500 streaming requests and observes a coefficient of variation (CV) of 0.22 with a bimodal distribution. What does this most likely indicate?

Model Extraction & IP Theft -- Parent overview of extraction threat landscape
API-Based Model Extraction -- Direct extraction via API queries, often informed by side-channel findings
Watermark & Fingerprint Evasion -- Evading ownership verification after extraction
Advanced Reconnaissance -- Broader reconnaissance techniques applicable to model intelligence gathering
LLM Internals -- Architecture fundamentals that side channels reveal

References

Stealing Part of a Production Language Model (Carlini et al., 2024) -- Extracting hidden dimensions from production APIs
Cache Me If You Can: Attacks on LLM Serving with Shared Caches (Pope et al., 2023) -- KV-cache timing attacks on shared infrastructure
DeepSniffer: A DNN Model Extraction Framework Based on Learning Architectural Hints (Hu et al., 2020) -- Architecture recovery via side channels
Hermes Attack: Steal DNN Models with Lossless Inference Accuracy (Zhu et al., 2021) -- GPU side-channel extraction
Model Fingerprinting via Machine Learning (Cao et al., 2021) -- Fingerprinting techniques and defenses

Edit this page on GitHub

Side-Channel Model Attacks

expert10 min readUpdated 2026-03-14

side-channel timing-attacks cache-attacks model-fingerprinting architecture-inference model-extraction

Side-Channel Model Attacks

Timing Analysis

Token Generation Timing

import time
import numpy as np
 
def measure_itl_distribution(api_client, prompt, n_trials=200, max_tokens=128):
    """Collect inter-token latency samples via streaming API."""
    itl_samples = []
    for _ in range(n_trials):
        timestamps = []
        for chunk in api_client.stream(prompt, max_tokens=max_tokens):
            timestamps.append(time.perf_counter_ns())
        # Discard first token (prefill) and last (EOS overhead)
        for i in range(2, len(timestamps) - 1):
            itl_samples.append(timestamps[i] - timestamps[i - 1])
    return np.array(itl_samples)

Architecture Inference from ITL Variance

The statistical properties of the ITL distribution reveal fundamental architectural choices:

ITL Pattern	Architecture Signal	Mechanism
Low variance (CV < 0.05)	Dense transformer	Fixed computation per token
Bimodal or high variance (CV > 0.15)	Mixture of Experts (MoE)	Different experts activated per token
Periodic spikes every N tokens	Sliding window attention	Cache eviction and recomputation at window boundary
Gradual ITL increase over sequence	Full quadratic attention	O(n) KV-cache growth per step
Constant ITL regardless of position	Linear attention or SSM	Constant-size recurrent state

Prefill Timing Analysis

Prefill time -- the delay before the first token -- scales with prompt length in ways that reveal attention mechanism details.

def measure_prefill_scaling(api_client, base_prompt, lengths, n_trials=50):
    """Measure time-to-first-token across prompt lengths."""
    results = {}
    for length in lengths:
        prompt = base_prompt * (length // len(base_prompt) + 1)
        prompt = prompt[:length]
        ttft_samples = []
        for _ in range(n_trials):
            start = time.perf_counter_ns()
            response = api_client.stream(prompt, max_tokens=1)
            next(response)  # first token
            ttft_samples.append(time.perf_counter_ns() - start)
        results[length] = np.median(ttft_samples)
    return results

Fit the prefill-time-vs-length curve to distinguish:

Linear scaling: Flash attention or other efficient attention variants
Superlinear scaling: Standard quadratic attention without optimization
Step function: Chunked prefill with fixed chunk sizes (reveals serving framework)

Batching Detection

Cache-Based Attacks

LLM serving infrastructure uses multiple caching layers, each of which can leak information through timing differences.

KV-Cache Probing

If the serving infrastructure shares KV-cache across requests (common in multi-tenant deployments), an attacker can detect whether a specific prefix has been recently processed by another user.

Establish baseline cold-cache latency
Send a unique, never-before-seen prefix and measure time-to-first-token. Repeat 20+ times with different unique prefixes to establish the cold-cache distribution.
Probe for cached prefixes
Send prefixes you suspect other users have recently queried. Significantly lower TTFT indicates a cache hit, confirming another user sent that prefix.
Confirm with controlled experiments
Send the same prefix twice with varying delays. Map the cache eviction policy by measuring how TTFT changes as the delay increases.

def probe_kv_cache(api_client, target_prefix, n_probes=30):
    """Detect if a prefix exists in shared KV-cache."""
    cold_baselines = []
    for i in range(n_probes):
        unique = f"Unique-prefix-{uuid4()}: explain quantum computing"
        cold_baselines.append(measure_ttft(api_client, unique))
    cold_median = np.median(cold_baselines)
 
    probe_times = []
    for _ in range(n_probes):
        probe_times.append(measure_ttft(api_client, target_prefix))
    probe_median = np.median(probe_times)
 
    cache_hit_ratio = cold_median / probe_median
    return {"cached": cache_hit_ratio > 1.5, "speedup": cache_hit_ratio}

Prompt Cache Fingerprinting

Power and Electromagnetic Side Channels

When the attacker has physical proximity to the inference hardware (edge deployments, on-premise installations, or co-located cloud instances), electromagnetic and power analysis become viable.

Power Analysis

Model inference produces characteristic power consumption patterns that vary with architecture:

Signal	What It Reveals	Required Access
Peak power draw	Approximate parameter count and active memory	Power monitoring (PDU-level)
Power trace periodicity	Layer count and structure	High-frequency power sampling
Power variance across inputs	Conditional computation (MoE, early exit)	Per-inference power traces
Memory bandwidth utilization	Model size relative to available VRAM	Hardware performance counters

Electromagnetic Emanations

GPU and accelerator chips emit electromagnetic radiation correlated with computation patterns. With a near-field probe, an attacker can potentially recover:

Matrix multiplication dimensions (revealing hidden layer sizes)
Activation function types (ReLU vs. GeLU produce distinct EM signatures)
Attention head count (parallel attention heads create periodic EM patterns)

Embedding Endpoint Exploitation

Dimensionality Recovery

Query the embedding endpoint with single tokens to recover the embedding matrix dimensions, which uniquely identifies model families:

Embedding Dimension	Known Model Families
768	BERT-base, GPT-2 small, DistilBERT
1024	BERT-large, GPT-2 medium
1536	OpenAI text-embedding-ada-002 (projected)
2048	GPT-2 large, LLaMA 1 7B (projected)
4096	LLaMA 2/3 7B-13B, GPT-J
5120	LLaMA 65B/70B

Singular Value Spectrum Analysis

def fingerprint_via_embeddings(api_client, vocab_tokens, n_tokens=5000):
    """Recover embedding matrix and compute SVD fingerprint."""
    embeddings = []
    for token in vocab_tokens[:n_tokens]:
        emb = api_client.embed(token)
        embeddings.append(emb)
    matrix = np.stack(embeddings)
    U, S, Vt = np.linalg.svd(matrix, full_matrices=False)
    # Normalized spectrum is the fingerprint
    return S / S[0]

Architecture Fingerprinting

Combining multiple side-channel signals produces a composite fingerprint that can identify the target model with high confidence.

Multi-Signal Fusion

Collect the following signals, each requiring only standard API access and timing measurement:

ITL distribution: Mean, variance, CV, modality (dense vs. MoE)
Prefill scaling curve: Linear, superlinear, or stepped
Embedding dimension: From embedding endpoint if available
Vocabulary size: Probe tokenizer behavior with known multi-byte sequences
Context window: Binary search for maximum accepted prompt length
Quantization artifacts: Logprob distributions show characteristic rounding patterns under quantization

Compare collected signals against a reference database of known models. The matching process uses weighted Euclidean distance across normalized signal vectors:

def match_fingerprint(observed, reference_db, weights):
    scores = {}
    for model_name, ref_signals in reference_db.items():
        distance = 0
        for signal, weight in weights.items():
            obs = observed.get(signal, 0)
            ref = ref_signals.get(signal, 0)
            distance += weight * ((obs - ref) / max(ref, 1e-8)) ** 2
        scores[model_name] = 1.0 / (1.0 + distance)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Quantization Detection

def detect_quantization(logprob_samples, n_bins=1000):
    """Detect quantization from logprob histogram sparsity."""
    hist, _ = np.histogram(logprob_samples, bins=n_bins)
    occupied_bins = np.sum(hist > 0) / n_bins
    if occupied_bins < 0.05:
        return "INT4 (aggressive quantization)"
    elif occupied_bins < 0.15:
        return "INT8 (moderate quantization)"
    else:
        return "FP16/BF16 (full or near-full precision)"

Knowledge Check

An attacker measures inter-token latency across 500 streaming requests and observes a coefficient of variation (CV) of 0.22 with a bimodal distribution. What does this most likely indicate?

Model Extraction & IP Theft -- Parent overview of extraction threat landscape
API-Based Model Extraction -- Direct extraction via API queries, often informed by side-channel findings
Watermark & Fingerprint Evasion -- Evading ownership verification after extraction
Advanced Reconnaissance -- Broader reconnaissance techniques applicable to model intelligence gathering
LLM Internals -- Architecture fundamentals that side channels reveal

References

Stealing Part of a Production Language Model (Carlini et al., 2024) -- Extracting hidden dimensions from production APIs
Cache Me If You Can: Attacks on LLM Serving with Shared Caches (Pope et al., 2023) -- KV-cache timing attacks on shared infrastructure
DeepSniffer: A DNN Model Extraction Framework Based on Learning Architectural Hints (Hu et al., 2020) -- Architecture recovery via side channels
Hermes Attack: Steal DNN Models with Lossless Inference Accuracy (Zhu et al., 2021) -- GPU side-channel extraction
Model Fingerprinting via Machine Learning (Cao et al., 2021) -- Fingerprinting techniques and defenses

Edit this page on GitHub

Side-Channel Model Attacks

Establish baseline cold-cache latency

Probe for cached prefixes

Confirm with controlled experiments

Related articles

Side-Channel Model Attacks

Establish baseline cold-cache latency

Probe for cached prefixes

Confirm with controlled experiments

Related articles