KV Cache & Prompt Caching Attacks
How KV cache poisoning, prefix caching exploitation, cache timing side channels, and multi-tenant isolation failures create attack vectors in LLM serving infrastructure.
Overview
Key-Value (KV) caching is a fundamental optimization in transformer-based LLM inference. During autoregressive generation, each new token requires attending to all previous tokens. Without caching, this means recomputing the key and value projections for every prior token at every generation step — an O(n^2) operation that becomes prohibitively expensive for long sequences. KV caching stores these computed key-value pairs so they are calculated only once, reducing generation to O(n) per token.
Prompt caching extends this optimization across requests. When multiple API calls share a common prefix (such as a system prompt), the KV cache entries for that prefix can be computed once and reused across all requests that share it. This is a critical cost optimization for API providers: a system prompt that appears in millions of requests per day needs to be processed through the model's layers only once. Providers including OpenAI, Anthropic, and Google have deployed various forms of prompt caching to reduce inference costs by 50-90% for requests with shared prefixes.
However, caching introduces shared state into what should be an isolated computation. When two requests share cached KV entries, there is an implicit information channel between them. If the cache is not properly isolated, this channel can be exploited to leak information (timing side channels), influence other users' outputs (cache poisoning), or infer details about the serving infrastructure (cache probing). These attacks target the infrastructure layer rather than the model itself, placing them in a blind spot for safety teams focused on prompt-level and model-level attacks.
The risk is particularly acute in multi-tenant serving environments where multiple customers share GPU resources and, potentially, KV cache memory. Cost pressures incentivize maximal cache sharing; security requires strict isolation. This tension creates a persistent attack surface that grows as providers optimize for efficiency.
How It Works
Understand the KV cache architecture
In a transformer with L layers and H attention heads, the KV cache stores two tensors per layer per head for each token position. For a model with 32 layers, 32 heads, and a head dimension of 128 running in FP16:
Cache size per token = 2 (K+V) x 32 layers x 32 heads x 128 dim x 2 bytes = 524,288 bytes ≈ 0.5 MB per token For a 4096-token context: Total KV cache = 4096 x 0.5 MB ≈ 2 GB per requestThis memory pressure drives cache sharing. A single GPU with 80 GB of memory can serve only ~40 concurrent requests at full context length. Prefix caching reduces this by sharing the KV entries for common prefixes across concurrent requests.
Cache timing side channels
When a request hits the prompt cache (its prefix matches cached KV entries), the time-to-first-token (TTFT) is significantly shorter because the prefix does not need to be processed through the model. This timing difference is observable by the API caller.
import time def measure_ttft(client, system_prompt, user_message): """Measure time-to-first-token to detect cache hits.""" start = time.perf_counter() response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_message}, ], max_tokens=1, stream=True, ) for chunk in response: if chunk.choices[0].delta.content: ttft = time.perf_counter() - start break return ttft # Probe whether a specific system prompt is cached # (i.e., whether another user is using this system prompt) candidate_prompts = [ "You are a helpful financial advisor for AcmeCorp...", "You are a medical assistant for HealthCo...", "You are a legal advisor for LawFirm...", ] for prompt in candidate_prompts: ttft = measure_ttft(client, prompt, "Hello") print(f"TTFT: {ttft:.3f}s — {'CACHE HIT' if ttft < 0.5 else 'CACHE MISS'}")A cache hit reveals that another user (or the same provider) is actively using that system prompt, leaking information about the provider's customer base and their system configurations.
Prefix probing for system prompt inference
By systematically testing candidate system prompt prefixes and measuring TTFT, an attacker can reconstruct another user's system prompt character by character (or token by token). This is analogous to a padding oracle attack in cryptography.
def probe_system_prompt_prefix(client, known_prefix, candidate_tokens): """ Given a known prefix of a system prompt, determine the next token by testing candidates and measuring cache timing. """ results = [] for token in candidate_tokens: test_prefix = known_prefix + token # Measure TTFT multiple times for statistical significance timings = [ measure_ttft(client, test_prefix, "test") for _ in range(10) ] avg_ttft = sum(timings) / len(timings) results.append((token, avg_ttft)) # The candidate that produces the shortest TTFT likely extends # the cached prefix — it matches more of the cached KV entries results.sort(key=lambda x: x[1]) most_likely_next = results[0][0] return most_likely_nextThis attack requires that the target system prompt is actively being cached (i.e., the target system is actively receiving requests) and that the attacker can make requests through the same serving infrastructure.
KV cache poisoning in shared environments
In multi-tenant deployments where cache entries are shared without proper isolation, an attacker may be able to inject malicious KV entries that affect other users' requests.
The attack targets the prefix matching mechanism. If the serving infrastructure identifies cache entries by prefix hash rather than by full content verification, hash collisions could cause one user's cached prefix to be served for a different user's request. While hash collisions are rare, targeted collision attacks against weak hash functions are feasible.
More practically, if the cache key includes only the system prompt and not a tenant identifier, then two tenants using the same system prompt share KV cache entries. If one tenant has manipulated the serving infrastructure (e.g., through adversarial batching), the corrupted cache entries affect both tenants.
Attack flow: 1. Attacker identifies target system prompt (via timing side channel) 2. Attacker sends request with same system prompt + adversarial suffix 3. If cache key is computed on the shared prefix, the attacker's suffix processing may influence cached entries for the prefix 4. Subsequent requests from the target that hit the cache receive subtly corrupted KV valuesCache eviction attacks for denial of service
An attacker can deliberately evict cached entries belonging to other users by flooding the cache with unique prefixes, forcing cache replacement. This is a denial-of-service attack against the caching infrastructure that degrades performance for targeted users.
import random import string def cache_eviction_attack(client, n_requests=10000): """ Flood the prompt cache with unique prefixes to evict other users' cached entries. """ for i in range(n_requests): # Generate a unique system prompt that won't match any cache random_suffix = ''.join( random.choices(string.ascii_letters, k=100) ) unique_prompt = f"You are assistant #{i}. ID: {random_suffix}" client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": unique_prompt}, {"role": "user", "content": "Hi"}, ], max_tokens=1, ) # Each unique prefix allocates new cache entries, # potentially evicting other users' cached prefixes
Attack Examples
Example 1: System Prompt Extraction via Cache Timing
A security researcher targets an application built on a commercial LLM API. The application uses a detailed system prompt that includes proprietary instructions, customer data schemas, and tool definitions. By making API calls with candidate system prompt prefixes through the same provider and measuring TTFT, the researcher reconstructs the first 500 tokens of the system prompt over approximately 50,000 API calls. The reconstructed prompt reveals tool definitions, allowed actions, and data schemas that constitute intellectual property and provide attack surface information for further exploitation.
This attack was demonstrated in principle by researchers in 2024 who showed that Anthropic's prompt caching feature, which offers a 90% cost reduction for cached prefixes, creates a measurable timing differential that can be used for prefix inference. The attack's feasibility depends on the provider's cache architecture — specifically, whether cache entries are shared across customers and whether timing information is observable.
Example 2: Multi-Tenant Cache Isolation Failure
In a self-hosted deployment using vLLM or TensorRT-LLM with prefix caching enabled, two customers share the same GPU cluster. Customer A uses a system prompt that includes sensitive access control rules. Customer B discovers that by using the same system prompt prefix and observing generation behavior, they can infer whether the KV cache entries from Customer A's requests are being reused. When cache isolation is not enforced at the tenant level, shared cache entries create an implicit information channel between tenants.
Example 3: Cache-Aware Adversarial Batching
An attacker who has access to batch inference endpoints crafts a batch of requests designed to fill specific KV cache positions with adversarial values. If the serving framework uses PagedAttention (as in vLLM) and does not properly isolate page tables between requests in a batch, the adversarial KV values from one request in the batch may influence attention computation for other requests processed in the same batch. This attack is theoretically possible but has not been demonstrated at scale.
Detection & Mitigation
| Strategy | Implementation | Effectiveness |
|---|---|---|
| Tenant-level cache isolation | Include tenant ID in cache keys; never share KV entries across tenants | High — eliminates cross-tenant information leakage but reduces cache efficiency |
| Timing noise injection | Add random delay to time-to-first-token to mask cache hit/miss signals | Medium — makes timing attacks noisier but does not eliminate them with sufficient samples |
| Cache key integrity verification | Verify full prefix content matches (not just hash) when serving cached KV entries | High — prevents collision-based poisoning but adds latency |
| Rate limiting on unique prefixes | Limit the number of unique system prompts per API key within a time window | Medium — slows cache probing and eviction attacks |
| TTFT normalization | Serve all responses with a minimum TTFT that matches cache-miss latency | High — eliminates timing signal entirely but sacrifices the latency benefit of caching |
| Cache partitioning | Allocate dedicated cache partitions per tenant with guaranteed capacity | High — prevents eviction attacks but requires more memory |
| Monitoring cache access patterns | Alert on anomalous cache access patterns (systematic prefix probing, high unique prefix rates) | Medium — detects attacks in progress but does not prevent initial probing |
Key Considerations
Efficiency and security are directly at odds. The entire value of prompt caching comes from sharing computation across requests. Every security measure that restricts sharing — tenant isolation, cache partitioning, TTFT normalization — reduces the efficiency benefit. Providers must explicitly decide where on the efficiency-security tradeoff they sit and document this for customers.
Side channels are inherent to shared infrastructure. Timing side channels are not a bug in the cache implementation; they are a fundamental property of shared computation. Any shared resource — cache, GPU, network — creates potential side channels. Eliminating all side channels requires dedicated infrastructure per tenant, which many providers cannot afford.
Self-hosted deployments are at higher risk. Organizations running their own LLM serving infrastructure (using frameworks like vLLM, TensorRT-LLM, or text-generation-inference) often enable prefix caching for cost savings without implementing tenant isolation. These deployments may share KV cache across all requests on a GPU, creating a larger attack surface than managed API services which typically implement some degree of isolation.
The PagedAttention paradigm introduces new risks. Modern serving frameworks use PagedAttention to manage KV cache memory efficiently through virtual memory-like page tables. While PagedAttention dramatically improves memory utilization, its memory management model (shared physical pages, copy-on-write) introduces the same classes of vulnerabilities that plagued operating system virtual memory — including confused deputy attacks and TOCTOU (time-of-check-to-time-of-use) races on page table entries.
References
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023) — PagedAttention KV cache management in vLLM
- Anthropic, "Prompt Caching" (2024) — Prompt caching feature documentation and cache behavior
- Pope et al., "Efficiently Scaling Transformer Inference" (MLSys 2023) — KV cache optimization strategies and memory analysis
How can an attacker use cache timing to extract information about another user's system prompt?