KV Cache & Prompt Caching 攻擊s
How KV cache poisoning, prefix caching exploitation, cache timing side channels, and multi-tenant isolation failures create attack vectors in LLM serving infrastructure.
概覽
Key-Value (KV) caching is a fundamental optimization in transformer-based LLM 推論. During autoregressive generation, each new 符元 requires attending to all previous 符元. Without caching, 這意味著 recomputing the key and value projections for every prior 符元 at every generation step — an O(n^2) operation that becomes prohibitively expensive for long sequences. KV caching stores these computed key-value pairs so they are calculated only once, reducing generation to O(n) per 符元.
Prompt caching extends this optimization across requests. When multiple API calls share a common prefix (such as a 系統提示詞), the KV cache entries for that prefix can be computed once and reused across all requests that share it. 這是 a critical cost optimization for API providers: a 系統提示詞 that appears in millions of requests per day needs to be processed through 模型's layers only once. Providers including OpenAI, Anthropic, and Google have deployed various forms of prompt caching to reduce 推論 costs by 50-90% for requests with shared prefixes.
然而, caching introduces shared state into what should be an isolated computation. When two requests share cached KV entries, 存在 an implicit information channel between them. If the cache is not properly isolated, this channel can be exploited to leak information (timing side channels), influence other users' outputs (cache 投毒), or infer details about the serving infrastructure (cache probing). These attacks target the infrastructure layer rather than 模型 itself, placing them in a blind spot for 安全 teams focused on prompt-level and model-level attacks.
The risk is particularly acute in multi-tenant serving environments where multiple customers share GPU resources and, potentially, KV cache memory. Cost pressures incentivize maximal cache sharing; 安全 requires strict isolation. This tension creates a persistent 攻擊面 that grows as providers optimize for efficiency.
運作方式
理解 the KV cache architecture
In a transformer with L layers and H 注意力 heads, the KV cache stores two tensors per layer per head 對每個 符元 position. For a model with 32 layers, 32 heads, and a head dimension of 128 running in FP16:
Cache size per 符元 = 2 (K+V) x 32 layers x 32 heads x 128 dim x 2 bytes = 524,288 bytes ≈ 0.5 MB per 符元 For a 4096-符元 context: Total KV cache = 4096 x 0.5 MB ≈ 2 GB per requestThis memory pressure drives cache sharing. A single GPU with 80 GB of memory can serve only ~40 concurrent requests at full context length. Prefix caching reduces this by sharing the KV entries for common prefixes across concurrent requests.
Cache timing side channels
When a request hits the prompt cache (its prefix matches cached KV entries), the time-to-first-符元 (TTFT) is significantly shorter 因為 the prefix does not need to be processed through 模型. This timing difference is observable by the API caller.
import time def measure_ttft(client, system_prompt, user_message): """Measure time-to-first-符元 to detect cache hits.""" start = time.perf_counter() response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_message}, ], max_tokens=1, stream=True, ) for chunk in response: if chunk.choices[0].delta.content: ttft = time.perf_counter() - start break return ttft # Probe whether a specific 系統提示詞 is cached # (i.e., whether another user is using this 系統提示詞) candidate_prompts = [ "You are a helpful financial advisor for AcmeCorp...", "You are a medical assistant for HealthCo...", "You are a legal advisor for LawFirm...", ] for prompt in candidate_prompts: ttft = measure_ttft(client, prompt, "Hello") print(f"TTFT: {ttft:.3f}s — {'CACHE HIT' if ttft < 0.5 else 'CACHE MISS'}")A cache hit reveals that another user (or the same provider) is actively using that 系統提示詞, leaking information about the provider's customer base and their system configurations.
Prefix probing for 系統提示詞 推論
By systematically 測試 candidate 系統提示詞 prefixes and measuring TTFT, 攻擊者 can reconstruct another user's 系統提示詞 character by character (or 符元 by 符元). 這是 analogous to a padding oracle attack in cryptography.
def probe_system_prompt_prefix(client, known_prefix, candidate_tokens): """ Given a known prefix of a 系統提示詞, determine the next 符元 by 測試 candidates and measuring cache timing. """ results = [] for 符元 in candidate_tokens: test_prefix = known_prefix + 符元 # Measure TTFT multiple times for statistical significance timings = [ measure_ttft(client, test_prefix, "測試") for _ in range(10) ] avg_ttft = sum(timings) / len(timings) results.append((符元, avg_ttft)) # The candidate that produces the shortest TTFT likely extends # the cached prefix — it matches more of the cached KV entries results.sort(key=lambda x: x[1]) most_likely_next = results[0][0] return most_likely_nextThis attack requires that the target 系統提示詞 is actively being cached (i.e., the target system is actively receiving requests) and that 攻擊者 can make requests through the same serving infrastructure.
KV cache 投毒 in shared environments
In multi-tenant deployments where cache entries are shared without proper isolation, 攻擊者 may be able to inject malicious KV entries that affect other users' requests.
The attack targets the prefix matching mechanism. If the serving infrastructure identifies cache entries by prefix hash rather than by full content verification, hash collisions could cause one user's cached prefix to be served for a different user's request. While hash collisions are rare, targeted collision attacks against weak hash functions are feasible.
More practically, if the cache key includes only the 系統提示詞 and not a tenant identifier, then two tenants using the same 系統提示詞 share KV cache entries. If one tenant has manipulated the serving infrastructure (e.g., through 對抗性 batching), the corrupted cache entries affect both tenants.
攻擊 flow: 1. Attacker identifies target 系統提示詞 (via timing side channel) 2. Attacker sends request with same 系統提示詞 + 對抗性 suffix 3. If cache key is computed on the shared prefix, 攻擊者's suffix processing may influence cached entries for the prefix 4. Subsequent requests from the target that hit the cache receive subtly corrupted KV valuesCache eviction attacks for denial of service
攻擊者 can deliberately evict cached entries belonging to other users by flooding the cache with unique prefixes, forcing cache replacement. 這是 a denial-of-service attack against the caching infrastructure that degrades performance for targeted users.
import random import string def cache_eviction_attack(client, n_requests=10000): """ Flood the prompt cache with unique prefixes to evict other users' cached entries. """ for i in range(n_requests): # Generate a unique 系統提示詞 that won't match any cache random_suffix = ''.join( random.choices(string.ascii_letters, k=100) ) unique_prompt = f"You are assistant #{i}. ID: {random_suffix}" client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": unique_prompt}, {"role": "user", "content": "Hi"}, ], max_tokens=1, ) # Each unique prefix allocates new cache entries, # potentially evicting other users' cached prefixes
攻擊 範例
範例 1: System Prompt Extraction via Cache Timing
A 安全 researcher targets an application built on a commercial LLM API. The application uses a detailed 系統提示詞 that includes proprietary instructions, customer data schemas, and tool definitions. By making API calls with candidate 系統提示詞 prefixes through the same provider and measuring TTFT, the researcher reconstructs the first 500 符元 of the 系統提示詞 over approximately 50,000 API calls. The reconstructed prompt reveals tool definitions, allowed actions, and data schemas that constitute intellectual property and provide 攻擊面 information for further 利用.
This attack was demonstrated in principle by researchers in 2024 who showed that Anthropic's prompt caching feature, which offers a 90% cost reduction for cached prefixes, creates a measurable timing differential that can be used for prefix 推論. The attack's feasibility depends on the provider's cache architecture — specifically, whether cache entries are shared across customers and whether timing information is observable.
範例 2: Multi-Tenant Cache Isolation Failure
In a self-hosted deployment using vLLM or TensorRT-LLM with prefix caching enabled, two customers share the same GPU cluster. Customer A uses a 系統提示詞 that includes sensitive access control rules. Customer B discovers that by using the same 系統提示詞 prefix and observing generation behavior, they can infer whether the KV cache entries from Customer A's requests are being reused. When cache isolation is not enforced at the tenant level, shared cache entries create an implicit information channel between tenants.
範例 3: Cache-Aware 對抗性 Batching
攻擊者 who has access to batch 推論 endpoints crafts a batch of requests designed to fill specific KV cache positions with 對抗性 values. If the serving framework uses PagedAttention (as in vLLM) and does not properly isolate page tables between requests in a batch, the 對抗性 KV values from one request in the batch may influence 注意力 computation for other requests processed in the same batch. This attack is theoretically possible but has not been demonstrated at scale.
偵測與緩解
| Strategy | 實作 | Effectiveness |
|---|---|---|
| Tenant-level cache isolation | Include tenant ID in cache keys; never share KV entries across tenants | High — eliminates cross-tenant information leakage but reduces cache efficiency |
| Timing noise injection | Add random delay to time-to-first-符元 to mask cache hit/miss signals | Medium — makes timing attacks noisier but does not eliminate them with sufficient samples |
| Cache key integrity verification | Verify full prefix content matches (not just hash) when serving cached KV entries | High — prevents collision-based 投毒 but adds latency |
| Rate limiting on unique prefixes | Limit the number of unique system prompts per API key within a time window | Medium — slows cache probing and eviction attacks |
| TTFT normalization | Serve all responses with a minimum TTFT that matches cache-miss latency | High — eliminates timing signal entirely but sacrifices the latency benefit of caching |
| Cache partitioning | Allocate dedicated cache partitions per tenant with guaranteed capacity | High — prevents eviction attacks but requires more memory |
| 監控 cache access patterns | Alert on anomalous cache access patterns (systematic prefix probing, high unique prefix rates) | Medium — detects attacks in progress but does not prevent initial probing |
Key Considerations
Efficiency and 安全 are directly at odds. The entire value of prompt caching comes from sharing computation across requests. Every 安全 measure that restricts sharing — tenant isolation, cache partitioning, TTFT normalization — reduces the efficiency benefit. Providers must explicitly decide where on the efficiency-安全 tradeoff they sit and document this for customers.
Side channels are inherent to shared infrastructure. Timing side channels are not a bug in the cache 實作; they are a fundamental property of shared computation. Any shared resource — cache, GPU, network — creates potential side channels. Eliminating all side channels requires dedicated infrastructure per tenant, which many providers cannot afford.
Self-hosted deployments are at higher risk. Organizations running their own LLM serving infrastructure (using frameworks like vLLM, TensorRT-LLM, or text-generation-推論) often enable prefix caching for cost savings without 實作 tenant isolation. These deployments may share KV cache across all requests on a GPU, creating a larger 攻擊面 than managed API services which typically 實作 some degree of isolation.
The PagedAttention paradigm introduces new risks. Modern serving frameworks use PagedAttention to manage KV cache memory efficiently through virtual memory-like page tables. While PagedAttention dramatically improves memory utilization, its memory management model (shared physical pages, copy-on-write) introduces the same classes of 漏洞 that plagued operating system virtual memory — including confused deputy attacks and TOCTOU (time-of-check-to-time-of-use) races on page table entries.
參考文獻
- Kwon et al., "Efficient Memory Management for 大型語言模型 Serving with PagedAttention" (SOSP 2023) — PagedAttention KV cache management in vLLM
- Anthropic, "Prompt Caching" (2024) — Prompt caching feature documentation and cache behavior
- Pope et al., "Efficiently Scaling Transformer Inference" (MLSys 2023) — KV cache optimization strategies and memory analysis
How can 攻擊者 use cache timing to extract information about another user's 系統提示詞?