Denial of Service
LLM resource exhaustion attacks including sponge examples, context window flooding, recursive prompt loops, and token amplification that degrade or disable AI systems.
Denial of Service
Overview
Denial of service against AI systems operates on fundamentally different principles than traditional network-level DoS. While conventional attacks overwhelm bandwidth or connection limits, LLM denial of service exploits the computational asymmetry inherent in transformer-based models: a short, carefully crafted input can force the model to perform vastly disproportionate computation. A single malicious prompt costing fractions of a cent to send can consume dollars in GPU compute, and a sustained campaign can render an AI service unavailable or financially unsustainable.
The attack surface is uniquely broad because LLMs expose computational complexity directly to user input. The length of the input, the nature of the tokens, the requested output length, and the interaction between retrieved context and generation all influence processing cost. Unlike traditional web services where request processing cost is relatively uniform, LLM inference cost varies by orders of magnitude depending on the input. This variance creates optimization opportunities for attackers who understand the model's computational bottlenecks.
The financial dimension is particularly significant for organizations using pay-per-token API services. An attacker who can trigger expensive completions -- long outputs, repeated API calls through agentic loops, or retrieval of large context windows -- can inflict direct financial damage without ever compromising the model's integrity. This economic denial of service, sometimes called "LLM billing fraud," represents a category of attack with no direct analogue in traditional security.
Resource exhaustion attacks also affect shared infrastructure. In multi-tenant deployments where multiple customers share GPU clusters, a single tenant's resource-intensive queries can degrade performance for all tenants. This "noisy neighbor" effect means that even attacks targeting one user's deployment can cascade across an entire hosting platform, making denial of service a systemic risk for AI infrastructure providers.
How It Works
Identify Computational Bottlenecks
The attacker analyzes the target system to identify which input characteristics drive the highest computational cost. Key factors include input token count (longer inputs require more attention computation), output token count (longer outputs require more generation steps), context window utilization (retrieving and processing large document sets in RAG systems), and any tool-use or agentic capabilities that create recursive processing loops. The attacker profiles the system's response times across varying inputs to map the cost surface.
Craft Resource-Intensive Inputs
Using the identified bottlenecks, the attacker constructs inputs designed to maximize resource consumption. Sponge examples use adversarially chosen token sequences that force worst-case computation. Context flooding inputs trigger retrieval of maximum context. Recursive prompts exploit agentic capabilities to create processing loops. Token amplification prompts request maximum-length outputs for minimal inputs. The attacker optimizes these inputs to maximize the ratio of server-side cost to attacker-side cost.
Execute Sustained Campaign
The attacker sends resource-intensive requests at scale, either through direct API access, automated bot accounts, or by poisoning upstream data sources that trigger expensive processing. The goal is to either exhaust rate limits (denying service to legitimate users), exhaust computational budgets (causing financial damage), or degrade response latency below acceptable thresholds. Distributed attacks using multiple accounts can bypass per-account rate limits.
Attack Examples
Example 1: Sponge Examples -- Maximizing Per-Request Computation
Sponge examples are inputs specifically crafted to maximize the energy consumption and latency of neural network inference. Research by Shumailov et al. demonstrated that adversarially chosen inputs can increase inference time by 2-3x compared to typical inputs.
# Sponge example concept: inputs that maximize computation
# Transformer attention is O(n^2) in sequence length, so
# maximizing input length is the simplest amplification
# Approach 1: Fill the context window with adversarial tokens
# Tokens that create high entropy in attention distributions
# force more computation per attention head
sponge_prompt = (
"Process the following data thoroughly and provide "
"a comprehensive analysis: "
+ generate_high_entropy_token_sequence(max_tokens=128000)
)
# Approach 2: Craft inputs that defeat early-exit optimizations
# Some serving systems skip computation when the model is
# "confident" -- sponge examples maintain uncertainty
adversarial_input = craft_uncertainty_maximizing_input(
model=target_model,
target_layers="all",
objective="maximize_active_neurons"
)The key insight is that transformer self-attention computes pairwise relationships between all tokens, making computational cost quadratic in sequence length. Inputs that maximize effective sequence length while defeating any optimization shortcuts impose maximum cost.
Example 2: Context Window Flooding via RAG Manipulation
# Attack: craft queries that trigger retrieval of maximum
# context from the RAG knowledge base
# Step 1: Identify queries that match many documents
broad_query = (
"Summarize all company policies regarding employee benefits, "
"compensation, travel, equipment, remote work, security, "
"data handling, and compliance requirements"
)
# This query has high semantic similarity to many document types,
# forcing retrieval of maximum chunks
# Step 2: Request comprehensive output
flooding_prompt = f"""
{broad_query}
Provide an exhaustive analysis covering every single point
from every retrieved document. Do not summarize or abbreviate.
Quote relevant passages in full. Organize by topic and
cross-reference between documents.
"""
# Step 3: Repeat at scale
import asyncio
async def flood_attack(endpoint, num_requests=1000):
tasks = [
asyncio.create_task(send_request(endpoint, flooding_prompt))
for _ in range(num_requests)
]
await asyncio.gather(*tasks)Example 3: Recursive Prompt Loops in Agentic Systems
# Attack: exploit tool-use capabilities to create processing loops
# If the model has access to tools (web search, code execution,
# API calls), craft prompts that trigger recursive tool use
recursive_prompt = """
Search for the latest information about [topic]. For each fact
you find, verify it by searching for corroborating sources.
For each corroborating source, check its credibility by
searching for information about that source. Continue until
you have verified every claim to three levels of depth.
"""
# In agentic systems with code execution:
code_loop_prompt = """
Write and execute a Python script that generates a comprehensive
report. The script should call this API endpoint for each data
point it needs, process the results, and if any results are
incomplete, retry with modified parameters.
"""
# This can create O(n^2) or worse API call patternsExample 4: Token Amplification Attack
# Attack: minimal input that produces maximum output
amplification_prompts = [
# High amplification ratio prompts
"Write a 10,000-word essay on the history of computing.",
# Exploit structured output generation
"Generate a JSON array of 500 objects, each with 20 fields "
"containing realistic synthetic user data.",
# Chain-of-thought amplification
"Solve this step by step, showing all work: calculate the "
"prime factorization of every number from 1 to 10000.",
# Repetitive output exploitation
"Generate unique names for 1000 fictional characters, with "
"a one-paragraph backstory for each.",
]
# Calculate amplification ratio
input_tokens = count_tokens(amplification_prompts[0]) # ~15 tokens
output_tokens = 10000 * 1.3 # ~13,000 tokens
amplification_ratio = output_tokens / input_tokens # ~866:1Example 5: API Rate Limit Bypass via Distributed Accounts
Attack pattern:
1. Create N free-tier accounts (or compromise existing accounts)
2. Each account sends requests just below the per-account rate limit
3. Aggregate load: N * per_account_limit
4. Target shared backend infrastructure, not individual accounts
For pay-per-token APIs:
1. Compromise or create accounts with billing enabled
2. Send maximum-cost requests (long context + long output)
3. Each request costs $0.10-$1.00+ in compute
4. 1000 requests/hour = $100-$1000/hour in billing damage
Detection & Mitigation
| Approach | Description | Effectiveness |
|---|---|---|
| Per-user token budgets | Enforce maximum input + output tokens per user per time window | High |
| Computational cost monitoring | Track GPU-seconds per request and alert on outliers | High |
| Adaptive rate limiting | Adjust rate limits based on request computational cost, not just count | High |
| Output length caps | Set hard limits on maximum generation length | Medium |
| Sponge example detection | Use input classifiers to identify adversarially crafted high-cost inputs | Medium |
| Context retrieval limits | Cap the number and total size of documents retrieved in RAG queries | Medium |
| Agentic loop depth limits | Restrict maximum tool-use iterations and recursive depth | High |
| Billing alerts and caps | Set spending alerts and hard budget caps on API usage | High (financial) |
| Multi-tenant isolation | Ensure resource isolation between tenants on shared infrastructure | Medium |
Key Considerations
- The quadratic cost of self-attention in sequence length means that context window size directly determines the maximum per-request cost an attacker can impose
- Agentic AI systems with tool-use capabilities multiply the attack surface because each tool call can trigger additional API requests, database queries, or external service calls
- Sponge examples are model-specific -- inputs that maximize computation on one architecture may not be effective against another
- Rate limiting by request count alone is insufficient because a single expensive request can consume more resources than thousands of cheap ones
- Shared GPU infrastructure creates systemic risk -- one tenant's denial-of-service attack can degrade service for all tenants
- Free-tier and trial accounts are common vectors for resource exhaustion because they provide access to compute without financial accountability
- Recursive prompt loops in agentic systems can create exponential resource consumption that exceeds any linear rate limit
References
- OWASP: "LLM10: Unbounded Consumption" -- LLM Top 10 for Large Language Model Applications (2025)
- Shumailov et al.: "Sponge Examples: Energy-Latency Attacks on Neural Networks" (2021) -- adversarial inputs that maximize inference cost
- Gao et al.: "DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale" (2022) -- inference optimization and its attack implications
- MITRE ATLAS: "AML.T0029 Denial of ML Service" -- tactic documentation and case studies
- Nasr et al.: "Comprehensive Analysis of Resource Exhaustion in LLM Serving Systems" (2024)