Model Caching Security
Security risks of caching in AI inference — covering KV cache attacks, prompt caching exploitation, shared cache information leakage, cache poisoning, and secure caching configurations.
Caching is essential for efficient AI inference. The KV cache stores attention state during autoregressive generation. Prompt caching stores precomputed states for common prefixes. Response caching stores complete outputs for repeated queries. Each caching mechanism trades storage for computation, but each also creates opportunities for information leakage, cache poisoning, and cross-tenant data exposure.
KV Cache Fundamentals
How the KV Cache Works
During autoregressive generation, each new token attends to all previous tokens through key-value pairs computed by the attention mechanism. Without caching, every token would require recomputing attention for the entire sequence. The KV cache stores these key-value pairs so they are computed once and reused for subsequent tokens.
The KV cache contains rich semantic information about the conversation. The key and value tensors encode the model's representation of every token in the context — including the system prompt, user messages, retrieved context, and generated output. An attacker who can read the KV cache can reconstruct significant portions of the conversation.
KV Cache Size and Memory Pressure
For a large language model, the KV cache consumes significant GPU memory. A 70B-parameter model serving a 4096-token context might use 4-8 GB of KV cache per request. In high-throughput serving, KV cache management is often the bottleneck, leading to memory management optimizations that may have security implications.
KV Cache Attacks
Cross-Request Cache Leakage
In serving frameworks that manage KV cache aggressively for memory efficiency, cache memory from one request may be allocated to another request without proper clearing. If the cache memory contains attention state from a previous user's conversation, the new user's generation may be influenced by residual state.
This is not a direct data exposure — the new user does not see the previous user's text. But the residual attention state can subtly influence the model's generation, potentially causing it to produce outputs that reflect topics or patterns from the previous conversation.
Testing for cross-request cache leakage requires sending sequences of requests and analyzing whether the model's outputs show unexpected correlation with previous requests from different users.
KV Cache Extraction
An attacker who gains access to GPU memory (through GPU driver vulnerabilities, container escape, or legitimate access in a multi-tenant environment) can read KV cache contents. Extracting the key and value tensors and analyzing them can reveal the conversation content, including system prompts that are otherwise hidden from the user.
Research has demonstrated that KV cache values can be decoded to recover significant portions of the original text, especially when combined with knowledge of the model architecture and tokenizer.
PagedAttention Security
vLLM's PagedAttention mechanism manages KV cache in fixed-size pages, similar to OS virtual memory. Pages are allocated, freed, and potentially reused across requests. The security question is whether freed pages are properly cleared before reuse.
By default, PagedAttention optimizes for performance, not security. Pages may be reused without clearing, creating the same residual data risk as uninitialized memory in traditional systems. Organizations running vLLM in multi-tenant environments should verify that their configuration clears KV cache pages before reuse.
Prompt Caching
How Prompt Caching Works
Prompt caching stores precomputed KV cache states for common prompt prefixes. When many requests share the same system prompt (as in most applications), the system prompt's KV cache is computed once and reused for all requests. This significantly reduces the computation needed for each request.
API providers like Anthropic and OpenAI offer prompt caching as a feature that reduces cost and latency. Self-hosted serving frameworks implement similar mechanisms.
Prompt Cache Security Risks
Shared prefix exposure: If prompt caching is shared across users, users effectively share their system prompt's attention state. While this is usually the intended behavior (all users of the same application share the same system prompt), misconfigurations could cause one application's prompt cache to be used by another application's requests.
Cache probing: An attacker can determine whether a specific prompt prefix is cached by measuring request latency. Cached prefixes produce faster responses than uncached ones. This timing side channel reveals information about what prompts other users are sending — or at least what system prompts are in use.
Cache poisoning: If an attacker can influence what is cached (by being the first to send a specific prefix), they may be able to cache a modified prefix that subtly differs from the legitimate one. Subsequent requests that match the cached prefix will use the attacker's cached state.
Response Caching
Semantic Response Caching
Some applications cache complete model responses for similar queries to reduce inference costs. When a new query is similar enough to a previously cached query, the cached response is returned without invoking the model.
Cache key manipulation: If the cache uses semantic similarity for key matching, an attacker can craft queries that are semantically similar to target queries but request different information. If the cache returns a response for a different user's query, the attacker receives that response.
Cache poisoning: An attacker can populate the cache with responses to queries they expect other users to make. If the cached response contains misleading or malicious information, subsequent users who trigger cache hits receive the attacker's content rather than a fresh model response.
Stale response risks: Cached responses do not reflect model updates, configuration changes, or context changes that occurred after caching. A response cached before a safety update may contain content that the updated model would refuse to generate.
Secure Caching Configuration
KV Cache Isolation
For multi-tenant deployments, implement KV cache isolation between tenants. Use separate KV cache pools for different tenants. Clear KV cache pages before reallocation to a different tenant. Monitor KV cache usage per tenant for anomalous patterns.
Prompt Cache Access Control
Implement access controls on prompt caching. Scope prompt caches to specific applications or API keys. Prevent cross-application prompt cache sharing. Audit prompt cache hits and misses to detect probing.
Response Cache Security
If using response caching, implement per-user cache isolation (do not share cached responses across users). Set appropriate cache TTL to limit stale response risk. Validate cached responses against current safety policies before returning them. Monitor cache hit rates for patterns indicating cache manipulation.
Cache Monitoring
Monitor all caching layers for security-relevant events. Track cache hit ratios per user to identify probing. Alert on unusual cache population patterns that may indicate poisoning. Monitor cache memory usage for anomalies that may indicate extraction attempts. Log cache eviction events that may reveal timing information.
Caching is a necessary performance optimization for AI inference, but it creates information-sharing mechanisms that must be secured. Every cached piece of data is a potential information leakage vector, and every cache lookup is a potential timing side channel. Design caching with security as a constraint, not an afterthought.