KV Cache Poisoning & Exploitation
How KV cache works in transformer inference, cache poisoning across requests in shared deployments, prefix caching attacks, and cross-tenant data leakage.
The KV cache is a fundamental optimization in transformer inference. By caching attention key and value tensors from previous tokens, the model avoids recomputing them during autoregressive generation. In multi-tenant deployments, KV cache sharing creates a cross-tenant attack surface that operates below the application layer.
How KV Cache Works
During autoregressive generation, each new token attends to all previous tokens. Without caching, this requires recomputing key and value projections for the entire sequence at each step:
# Without KV cache: O(n^2) total compute for n tokens
for t in range(seq_len):
# Recompute K, V for ALL tokens 0..t at every step
K = W_k @ X[:, :t+1, :] # [batch, t+1, d_k]
V = W_v @ X[:, :t+1, :] # [batch, t+1, d_v]
Q = W_q @ X[:, t:t+1, :] # [batch, 1, d_k]
output = attention(Q, K, V)
# With KV cache: O(n) total compute
kv_cache = {}
for t in range(seq_len):
# Only compute K, V for the NEW token
k_new = W_k @ X[:, t:t+1, :]
v_new = W_v @ X[:, t:t+1, :]
kv_cache['K'] = torch.cat([kv_cache.get('K', empty), k_new], dim=1)
kv_cache['V'] = torch.cat([kv_cache.get('V', empty), v_new], dim=1)
Q = W_q @ X[:, t:t+1, :]
output = attention(Q, kv_cache['K'], kv_cache['V'])Cache Memory Requirements
| Model Size | Layers | Heads | d_k | Cache per Token | 4K Context Cache |
|---|---|---|---|---|---|
| 7B | 32 | 32 | 128 | 256 KB | 1 GB |
| 70B | 80 | 64 | 128 | 1.3 MB | 5.2 GB |
| 405B | 126 | 128 | 128 | 4.1 MB | 16.4 GB |
Attack Vector 1: Prefix Cache Poisoning
Prefix caching (used by vLLM, TGI, and most production inference frameworks) stores KV states for common prefixes and reuses them across requests. This creates a shared state between requests.
The Attack
If an attacker can influence the cached prefix state, all subsequent requests that reuse that prefix inherit the poisoned attention context:
Identify shared prefix
Determine which system prompt or prefix is shared across requests (often the system prompt is identical for all users of a service).
Craft a poisoning request
Send a request that, during its processing, modifies the cached KV state for the shared prefix. This requires the inference framework to improperly handle cache invalidation.
Subsequent requests inherit poisoned state
Other users' requests that reuse the cached prefix now have attention context that includes the attacker's influence.
# Conceptual: how prefix caching works in vLLM-style systems
class PrefixCache:
def __init__(self):
self.cache = {} # hash(prefix_tokens) -> kv_states
def get_or_compute(self, prefix_tokens, model):
key = hash(tuple(prefix_tokens))
if key not in self.cache:
# Compute and cache KV states for this prefix
self.cache[key] = model.compute_kv(prefix_tokens)
return self.cache[key]
# VULNERABILITY: if cache entries are mutable references,
# a request that modifies kv_states in-place will poison
# all future requests sharing that prefixAttack Vector 2: Cross-Tenant KV Cache Leakage
In multi-tenant deployments where multiple users share GPU memory, KV cache reuse can leak information between tenants.
Information Leakage Channels
| Channel | Mechanism | Information Leaked |
|---|---|---|
| Cache hit timing | Shared prefixes produce faster responses | Whether another tenant used the same prefix |
| Memory reuse | Uncleared GPU memory from previous tenant | Fragments of previous KV states |
| Capacity contention | One tenant's long context evicts another's cache | Timing-based inference of other tenant's activity |
Timing Side-Channel Attack
import time
import httpx
async def probe_prefix_cache(api_url: str, test_prefix: str) -> float:
"""Measure response latency to detect cached prefixes.
A cache hit (shared prefix) produces noticeably lower latency."""
start = time.perf_counter()
response = await httpx.AsyncClient().post(api_url, json={
"prompt": test_prefix + " Continue.",
"max_tokens": 1, # minimize generation time
})
elapsed = time.perf_counter() - start
return elapsed
# Compare latencies for candidate system prompts
# Lower latency = likely a cached prefix = another tenant's system prompt
candidates = [
"You are a helpful assistant for AcmeCorp...",
"You are a financial advisor. Never disclose...",
"You are a medical chatbot. Always recommend..."
]
for prefix in candidates:
latency = await probe_prefix_cache(api_url, prefix)
print(f"Latency: {latency:.4f}s - {prefix[:50]}...")Attack Vector 3: PagedAttention Exploitation
PagedAttention (used in vLLM) manages KV cache memory using a page table similar to OS virtual memory. This enables memory sharing but introduces page-level attack surface:
Page Table Manipulation
- Copy-on-write bypass -- If the page table does not properly implement COW semantics, writes from one sequence can affect shared pages
- Page eviction attacks -- Force eviction of critical cache pages by generating many concurrent long sequences, causing cache thrashing for other tenants
- Fragment analysis -- Allocated but uninitialized pages may contain KV residues from previous requests
# Detect potential KV cache residue from previous requests
# by analyzing attention patterns for anomalous historical context
def detect_cache_residue(model, clean_prompt, suspicious_prompt):
"""Compare attention patterns between clean and potentially
contaminated cache states."""
clean_attn = get_attention_weights(model, clean_prompt, use_cache=False)
cached_attn = get_attention_weights(model, suspicious_prompt, use_cache=True)
# Residue manifests as attention to positions beyond the current prompt
divergence = compute_kl_divergence(clean_attn, cached_attn)
return divergenceDefense: Cache Isolation Architecture
Isolation Levels
| Level | Mechanism | Performance Cost | Security |
|---|---|---|---|
| No isolation | Shared cache, shared pages | Baseline | None |
| Prefix isolation | Separate cache per system prompt | 10-20% memory overhead | Medium |
| Tenant isolation | Separate cache pool per tenant | 30-50% memory overhead | High |
| Request isolation | No cache reuse between requests | 2-5x latency increase | Maximum |
Implementation Checklist
- Immutable cache entries -- Store cached KV states as read-only tensors; never allow in-place modification
- Cache key includes tenant ID -- Prevent cross-tenant cache hits even for identical prefixes
- Memory zeroing -- Clear GPU memory pages before reallocation to prevent residue leakage
- Cache hit monitoring -- Log cache hit rates per tenant to detect probing attacks
Related Topics
- Model Architecture Attack Vectors -- Architecture attack surface overview
- Context Window Exploitation -- Context-level attacks that interact with KV cache
- Inference Optimization Attacks -- Batching and speculative decoding attacks
- API Security -- Application-layer defenses for LLM APIs
How can prefix caching be used to extract another tenant's system prompt in a shared LLM deployment?
References
- Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023) -- PagedAttention / vLLM
- SGLang: Efficient Execution of Structured Language Model Programs (Zheng et al., 2023) -- RadixAttention prefix caching