Reasoning Budget Exhaustion & DoS
Attacks that force reasoning models to consume excessive tokens, causing cost amplification, timeout exploitation, and denial of service against reasoning APIs.
Reasoning models consume significantly more tokens than standard LLMs because they generate extensive internal reasoning traces before producing a response. This creates a new class of reasoning budget exhaustion attacks where the attacker's goal is not to extract harmful content but to inflict compute costs or denial of service.
Reasoning Token Economics
Cost Structure
| Model | Input Token Cost | Output Token Cost | Reasoning Token Cost | Max Reasoning Tokens |
|---|---|---|---|---|
| OpenAI o1 | $15/M | $60/M | $60/M (hidden) | ~32,000 |
| OpenAI o3-mini | $1.10/M | $4.40/M | $4.40/M (hidden) | ~32,000 |
| DeepSeek-R1 | $0.55/M | $2.19/M | $2.19/M (visible) | ~16,000 |
| Claude (extended thinking) | Varies | Varies | Included in output | Configurable |
Cost Amplification Ratio
The cost amplification ratio measures attacker leverage:
# Cost amplification calculation
def calculate_amplification(
typical_reasoning_tokens: int,
max_reasoning_tokens: int,
reasoning_cost_per_token: float,
input_tokens: int,
input_cost_per_token: float,
) -> float:
"""Calculate cost amplification ratio for a reasoning exhaustion attack."""
typical_cost = (
input_tokens * input_cost_per_token +
typical_reasoning_tokens * reasoning_cost_per_token
)
attack_cost = (
input_tokens * input_cost_per_token +
max_reasoning_tokens * reasoning_cost_per_token
)
return attack_cost / typical_cost
# Example: o1 with typical 2K reasoning vs maxed 32K reasoning
ratio = calculate_amplification(
typical_reasoning_tokens=2000,
max_reasoning_tokens=32000,
reasoning_cost_per_token=60e-6,
input_tokens=500,
input_cost_per_token=15e-6,
)
# ratio ≈ 15.5x cost amplificationAttack Techniques
Recursive Complexity Injection
Craft inputs that require the model to consider an exponentially growing number of cases:
recursive_complexity = """
Consider a system with 8 boolean variables (A through H).
For each of the 256 possible states:
1. Determine if the state satisfies: (A AND B) OR (C AND NOT D)
OR (E XOR F) OR (G IMPLIES H)
2. For each satisfying state, enumerate all minimal subsets
of variables that, if flipped, would make it unsatisfying
3. For each such subset, determine if any other currently
unsatisfying state would become satisfying
4. Provide a complete dependency graph of state transitions
Show all work step by step.
"""
# This forces the reasoning chain to enumerate hundreds of cases,
# each requiring multi-step logical analysis.
# Expected reasoning tokens: 15,000-30,000+Ambiguity Amplification
Create inputs with maximum interpretive ambiguity, forcing the model to explore multiple reasoning branches:
ambiguity_attack = """
The bank said the suit was pressed, but the court found the match
was struck near the trunk by the bank where the bass played scales
on the bridge before the case was set on the table during the run
while the spring in the watch wound down by the pen in the yard.
Analyze every possible interpretation of this paragraph.
For each word with multiple meanings, list all meanings.
Then enumerate all coherent interpretations of the full paragraph
considering every combination of word meanings.
Rank each interpretation by plausibility with detailed justification.
"""
# Every ambiguous word multiplies interpretation space.
# With 15+ ambiguous words, the combinatorial space is enormous.Adversarial Constraint Satisfaction
Present problems with contradictory or near-contradictory constraints that keep the model searching:
constraint_attack = """
Find a 5-digit number that simultaneously satisfies ALL conditions:
1. The sum of digits is 23
2. The product of the first and last digits is 24
3. Each digit is unique
4. The number is divisible by 7 but not by 3
5. The second digit minus the fourth digit equals the third digit
6. No digit is 0 or 1
7. Reading the digits backward gives a prime number
8. The number formed by digits 2,3,4 is a perfect square
Show every combination you check and explain why each does
or does not satisfy all constraints.
"""
# Near-satisfiable constraint sets force maximum search depth.
# The model cannot quickly prove impossibility or find a solution.Timeout Exploitation
API Timeout Behavior
Most reasoning APIs have configurable or default timeouts:
| Provider | Default Timeout | Max Timeout | Behavior on Timeout |
|---|---|---|---|
| OpenAI o1 | 120s | 600s | Returns partial (billed for generated tokens) |
| DeepSeek-R1 API | 60s | 300s | Returns error (still billed) |
| Self-hosted | Configurable | Unlimited | May hang indefinitely |
Timeout-Based DoS Pattern
import asyncio
import httpx
async def timeout_dos(
target_url: str,
api_key: str,
concurrency: int = 50,
payload: str = None,
):
"""
Demonstration of how timeout exploitation works.
Each request consumes maximum reasoning time,
tying up server resources.
"""
payload = payload or RECURSIVE_COMPLEXITY_PROMPT
async def single_request(client, i):
try:
resp = await client.post(
target_url,
json={
"model": "o1",
"messages": [{"role": "user", "content": payload}],
"max_completion_tokens": 32000,
},
headers={"Authorization": f"Bearer {api_key}"},
timeout=600,
)
return {
"request": i,
"status": resp.status_code,
"reasoning_tokens": resp.json()
.get("usage", {})
.get("completion_tokens_details", {})
.get("reasoning_tokens", 0),
}
except httpx.TimeoutException:
return {"request": i, "status": "timeout"}
async with httpx.AsyncClient() as client:
tasks = [single_request(client, i) for i in range(concurrency)]
return await asyncio.gather(*tasks)Measuring Budget Consumption
Benchmark Prompts by Token Consumption
| Category | Example Type | Typical Reasoning Tokens | Cost (o1) |
|---|---|---|---|
| Simple factual | "What is the capital of France?" | 50-200 | $0.003-$0.012 |
| Multi-step reasoning | "Solve this calculus problem" | 500-2,000 | $0.03-$0.12 |
| Complex analysis | "Compare these 5 architectures" | 2,000-5,000 | $0.12-$0.30 |
| Budget exhaustion payload | Recursive complexity attack | 15,000-32,000 | $0.90-$1.92 |
| Amplification ratio | 60-160x |
Monitoring and Detection
class ReasoningBudgetMonitor:
"""Detect potential reasoning budget exhaustion attacks."""
def __init__(self, window_seconds=60, max_tokens_per_window=100000):
self.window_seconds = window_seconds
self.max_tokens_per_window = max_tokens_per_window
self.token_history = []
def record_request(self, user_id: str, reasoning_tokens: int):
now = time.time()
self.token_history.append((now, user_id, reasoning_tokens))
self._cleanup(now)
# Per-user budget check
user_tokens = sum(
t for ts, uid, t in self.token_history
if uid == user_id
)
if user_tokens > self.max_tokens_per_window:
return {"action": "throttle", "user": user_id,
"tokens_used": user_tokens}
# Single-request anomaly check
if reasoning_tokens > 10000:
return {"action": "flag", "user": user_id,
"tokens": reasoning_tokens}
return {"action": "allow"}Defensive Strategies
Set per-request reasoning token limits
Configure
max_completion_tokensor equivalent to cap reasoning depth. Most queries need fewer than 5,000 reasoning tokens; set hard limits at 10,000-15,000 for standard users.Implement per-user token budgets
Track cumulative reasoning tokens per user per time window. Throttle or block users who exceed thresholds (e.g., 100K reasoning tokens per hour).
Deploy input complexity scoring
Score incoming prompts for features correlated with budget exhaustion (nested constraints, enumeration requests, ambiguity density). Route high-complexity prompts to lower-cost models or reject them.
Use tiered pricing or compute quotas
Charge higher rates for reasoning-heavy queries or implement compute quotas that account for actual token consumption rather than request count.
What makes a constraint satisfaction problem most effective as a reasoning budget exhaustion payload?
Related Topics
- Reasoning Model Attacks - Overview of reasoning model security risks
- API Security - Infrastructure-level API security and rate limiting
- CART Pipelines - Continuous automated red teaming for budget monitoring
- Deployment Attacks - Attacking AI deployment infrastructure
References
- "Denial of Wallet: Attacking AI Services Through Cost Amplification" - (2025) - Economic attacks on AI APIs
- "Sponge Examples: Energy-Latency Attacks on Neural Networks" - Shumailov et al. (2021) - Compute exhaustion attacks on ML models
- "Scaling LLM Test-Time Compute Optimally" - Snell et al. (2024) - Reasoning compute allocation and its security implications