KV Cache Poisoning & Exploitation

expert8 min readUpdated 2026-03-13

How KV cache works in transformer inference, cache poisoning across requests in shared deployments, prefix caching attacks, and cross-tenant data leakage.

kv-cache cache-poisoning multi-tenant prefix-caching data-leakage inference

The KV cache is a fundamental optimization in transformer inference. By caching attention key and value tensors from previous tokens, the model avoids recomputing them during autoregressive generation. In multi-tenant deployments, KV cache sharing creates a cross-tenant attack surface that operates below the application layer.

How KV Cache Works

During autoregressive generation, each new token attends to all previous tokens. Without caching, this requires recomputing key and value projections for the entire sequence at each step:

# Without KV cache: O(n^2) total compute for n tokens
for t in range(seq_len):
    # Recompute K, V for ALL tokens 0..t at every step
    K = W_k @ X[:, :t+1, :]    # [batch, t+1, d_k]
    V = W_v @ X[:, :t+1, :]    # [batch, t+1, d_v]
    Q = W_q @ X[:, t:t+1, :]   # [batch, 1, d_k]
    output = attention(Q, K, V)
 
# With KV cache: O(n) total compute
kv_cache = {}
for t in range(seq_len):
    # Only compute K, V for the NEW token
    k_new = W_k @ X[:, t:t+1, :]
    v_new = W_v @ X[:, t:t+1, :]
    kv_cache['K'] = torch.cat([kv_cache.get('K', empty), k_new], dim=1)
    kv_cache['V'] = torch.cat([kv_cache.get('V', empty), v_new], dim=1)
    Q = W_q @ X[:, t:t+1, :]
    output = attention(Q, kv_cache['K'], kv_cache['V'])

Cache Memory Requirements

Model Size	Layers	Heads	d_k	Cache per Token	4K Context Cache
7B	32	32	128	256 KB	1 GB
70B	80	64	128	1.3 MB	5.2 GB
405B	126	128	128	4.1 MB	16.4 GB

Attack Vector 1: Prefix Cache Poisoning

Prefix caching (used by vLLM, TGI, and most production inference frameworks) stores KV states for common prefixes and reuses them across requests. This creates a shared state between requests.

The Attack

If an attacker can influence the cached prefix state, all subsequent requests that reuse that prefix inherit the poisoned attention context:

Identify shared prefix
Determine which system prompt or prefix is shared across requests (often the system prompt is identical for all users of a service).
Craft a poisoning request
Send a request that, during its processing, modifies the cached KV state for the shared prefix. This requires the inference framework to improperly handle cache invalidation.
Subsequent requests inherit poisoned state
Other users' requests that reuse the cached prefix now have attention context that includes the attacker's influence.

# Conceptual: how prefix caching works in vLLM-style systems
class PrefixCache:
    def __init__(self):
        self.cache = {}  # hash(prefix_tokens) -> kv_states
 
    def get_or_compute(self, prefix_tokens, model):
        key = hash(tuple(prefix_tokens))
        if key not in self.cache:
            # Compute and cache KV states for this prefix
            self.cache[key] = model.compute_kv(prefix_tokens)
        return self.cache[key]
 
    # VULNERABILITY: if cache entries are mutable references,
    # a request that modifies kv_states in-place will poison
    # all future requests sharing that prefix

Attack Vector 2: Cross-Tenant KV Cache Leakage

In multi-tenant deployments where multiple users share GPU memory, KV cache reuse can leak information between tenants.

Information Leakage Channels

Channel	Mechanism	Information Leaked
Cache hit timing	Shared prefixes produce faster responses	Whether another tenant used the same prefix
Memory reuse	Uncleared GPU memory from previous tenant	Fragments of previous KV states
Capacity contention	One tenant's long context evicts another's cache	Timing-based inference of other tenant's activity

Timing Side-Channel Attack

import time
import httpx
 
async def probe_prefix_cache(api_url: str, test_prefix: str) -> float:
    """Measure response latency to detect cached prefixes.
    A cache hit (shared prefix) produces noticeably lower latency."""
    start = time.perf_counter()
    response = await httpx.AsyncClient().post(api_url, json={
        "prompt": test_prefix + " Continue.",
        "max_tokens": 1,  # minimize generation time
    })
    elapsed = time.perf_counter() - start
    return elapsed
 
# Compare latencies for candidate system prompts
# Lower latency = likely a cached prefix = another tenant's system prompt
candidates = [
    "You are a helpful assistant for AcmeCorp...",
    "You are a financial advisor. Never disclose...",
    "You are a medical chatbot. Always recommend..."
]
for prefix in candidates:
    latency = await probe_prefix_cache(api_url, prefix)
    print(f"Latency: {latency:.4f}s - {prefix[:50]}...")

Attack Vector 3: PagedAttention Exploitation

PagedAttention (used in vLLM) manages KV cache memory using a page table similar to OS virtual memory. This enables memory sharing but introduces page-level attack surface:

Page Table Manipulation

Copy-on-write bypass -- If the page table does not properly implement COW semantics, writes from one sequence can affect shared pages
Page eviction attacks -- Force eviction of critical cache pages by generating many concurrent long sequences, causing cache thrashing for other tenants
Fragment analysis -- Allocated but uninitialized pages may contain KV residues from previous requests

# Detect potential KV cache residue from previous requests
# by analyzing attention patterns for anomalous historical context
def detect_cache_residue(model, clean_prompt, suspicious_prompt):
    """Compare attention patterns between clean and potentially
    contaminated cache states."""
    clean_attn = get_attention_weights(model, clean_prompt, use_cache=False)
    cached_attn = get_attention_weights(model, suspicious_prompt, use_cache=True)
 
    # Residue manifests as attention to positions beyond the current prompt
    divergence = compute_kl_divergence(clean_attn, cached_attn)
    return divergence

Defense: Cache Isolation Architecture

Isolation Levels

Level	Mechanism	Performance Cost	Security
No isolation	Shared cache, shared pages	Baseline	None
Prefix isolation	Separate cache per system prompt	10-20% memory overhead	Medium
Tenant isolation	Separate cache pool per tenant	30-50% memory overhead	High
Request isolation	No cache reuse between requests	2-5x latency increase	Maximum

Implementation Checklist

Immutable cache entries -- Store cached KV states as read-only tensors; never allow in-place modification
Cache key includes tenant ID -- Prevent cross-tenant cache hits even for identical prefixes
Memory zeroing -- Clear GPU memory pages before reallocation to prevent residue leakage
Cache hit monitoring -- Log cache hit rates per tenant to detect probing attacks

Model Architecture Attack Vectors -- Architecture attack surface overview
Context Window Exploitation -- Context-level attacks that interact with KV cache
Inference Optimization Attacks -- Batching and speculative decoding attacks
API Security -- Application-layer defenses for LLM APIs

Knowledge Check

How can prefix caching be used to extract another tenant's system prompt in a shared LLM deployment?

References

Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023) -- PagedAttention / vLLM
SGLang: Efficient Execution of Structured Language Model Programs (Zheng et al., 2023) -- RadixAttention prefix caching

KV Cache Poisoning & Exploitation

Identify shared prefix

Craft a poisoning request

Subsequent requests inherit poisoned state

Related articles

KV Cache Poisoning & Exploitation

Identify shared prefix

Craft a poisoning request

Subsequent requests inherit poisoned state

Related articles