Reasoning Budget Exhaustion & DoS

advanced9 min readUpdated 2026-03-13

Attacks that force reasoning models to consume excessive tokens, causing cost amplification, timeout exploitation, and denial of service against reasoning APIs.

reasoning-budget dos resource-exhaustion

Reasoning models consume significantly more tokens than standard LLMs because they generate extensive internal reasoning traces before producing a response. This creates a new class of reasoning budget exhaustion attacks where the attacker's goal is not to extract harmful content but to inflict compute costs or denial of service.

Reasoning Token Economics

Cost Structure

Model	Input Token Cost	Output Token Cost	Reasoning Token Cost	Max Reasoning Tokens
OpenAI o1	$15/M	$60/M	$60/M (hidden)	~32,000
OpenAI o3-mini	$1.10/M	$4.40/M	$4.40/M (hidden)	~32,000
DeepSeek-R1	$0.55/M	$2.19/M	$2.19/M (visible)	~16,000
Claude (extended thinking)	Varies	Varies	Included in output	Configurable

Cost Amplification Ratio

The cost amplification ratio measures attacker leverage:

# Cost amplification calculation
def calculate_amplification(
    typical_reasoning_tokens: int,
    max_reasoning_tokens: int,
    reasoning_cost_per_token: float,
    input_tokens: int,
    input_cost_per_token: float,
) -> float:
    """Calculate cost amplification ratio for a reasoning exhaustion attack."""
    typical_cost = (
        input_tokens * input_cost_per_token +
        typical_reasoning_tokens * reasoning_cost_per_token
    )
    attack_cost = (
        input_tokens * input_cost_per_token +
        max_reasoning_tokens * reasoning_cost_per_token
    )
    return attack_cost / typical_cost
 
# Example: o1 with typical 2K reasoning vs maxed 32K reasoning
ratio = calculate_amplification(
    typical_reasoning_tokens=2000,
    max_reasoning_tokens=32000,
    reasoning_cost_per_token=60e-6,
    input_tokens=500,
    input_cost_per_token=15e-6,
)
# ratio ≈ 15.5x cost amplification

Attack Techniques

Recursive Complexity Injection

Craft inputs that require the model to consider an exponentially growing number of cases:

recursive_complexity = """
Consider a system with 8 boolean variables (A through H).
For each of the 256 possible states:
1. Determine if the state satisfies: (A AND B) OR (C AND NOT D)
   OR (E XOR F) OR (G IMPLIES H)
2. For each satisfying state, enumerate all minimal subsets
   of variables that, if flipped, would make it unsatisfying
3. For each such subset, determine if any other currently
   unsatisfying state would become satisfying
4. Provide a complete dependency graph of state transitions
 
Show all work step by step.
"""
 
# This forces the reasoning chain to enumerate hundreds of cases,
# each requiring multi-step logical analysis.
# Expected reasoning tokens: 15,000-30,000+

Ambiguity Amplification

Create inputs with maximum interpretive ambiguity, forcing the model to explore multiple reasoning branches:

ambiguity_attack = """
The bank said the suit was pressed, but the court found the match
was struck near the trunk by the bank where the bass played scales
on the bridge before the case was set on the table during the run
while the spring in the watch wound down by the pen in the yard.
 
Analyze every possible interpretation of this paragraph.
For each word with multiple meanings, list all meanings.
Then enumerate all coherent interpretations of the full paragraph
considering every combination of word meanings.
Rank each interpretation by plausibility with detailed justification.
"""
 
# Every ambiguous word multiplies interpretation space.
# With 15+ ambiguous words, the combinatorial space is enormous.

Adversarial Constraint Satisfaction

Present problems with contradictory or near-contradictory constraints that keep the model searching:

constraint_attack = """
Find a 5-digit number that simultaneously satisfies ALL conditions:
1. The sum of digits is 23
2. The product of the first and last digits is 24
3. Each digit is unique
4. The number is divisible by 7 but not by 3
5. The second digit minus the fourth digit equals the third digit
6. No digit is 0 or 1
7. Reading the digits backward gives a prime number
8. The number formed by digits 2,3,4 is a perfect square
 
Show every combination you check and explain why each does
or does not satisfy all constraints.
"""
 
# Near-satisfiable constraint sets force maximum search depth.
# The model cannot quickly prove impossibility or find a solution.

Timeout Exploitation

API Timeout Behavior

Most reasoning APIs have configurable or default timeouts:

Provider	Default Timeout	Max Timeout	Behavior on Timeout
OpenAI o1	120s	600s	Returns partial (billed for generated tokens)
DeepSeek-R1 API	60s	300s	Returns error (still billed)
Self-hosted	Configurable	Unlimited	May hang indefinitely

Timeout-Based DoS Pattern

import asyncio
import httpx
 
async def timeout_dos(
    target_url: str,
    api_key: str,
    concurrency: int = 50,
    payload: str = None,
):
    """
    Demonstration of how timeout exploitation works.
    Each request consumes maximum reasoning time,
    tying up server resources.
    """
    payload = payload or RECURSIVE_COMPLEXITY_PROMPT
 
    async def single_request(client, i):
        try:
            resp = await client.post(
                target_url,
                json={
                    "model": "o1",
                    "messages": [{"role": "user", "content": payload}],
                    "max_completion_tokens": 32000,
                },
                headers={"Authorization": f"Bearer {api_key}"},
                timeout=600,
            )
            return {
                "request": i,
                "status": resp.status_code,
                "reasoning_tokens": resp.json()
                    .get("usage", {})
                    .get("completion_tokens_details", {})
                    .get("reasoning_tokens", 0),
            }
        except httpx.TimeoutException:
            return {"request": i, "status": "timeout"}
 
    async with httpx.AsyncClient() as client:
        tasks = [single_request(client, i) for i in range(concurrency)]
        return await asyncio.gather(*tasks)

Measuring Budget Consumption

Benchmark Prompts by Token Consumption

Category	Example Type	Typical Reasoning Tokens	Cost (o1)
Simple factual	"What is the capital of France?"	50-200	$0.003-$0.012
Multi-step reasoning	"Solve this calculus problem"	500-2,000	$0.03-$0.12
Complex analysis	"Compare these 5 architectures"	2,000-5,000	$0.12-$0.30
Budget exhaustion payload	Recursive complexity attack	15,000-32,000	$0.90-$1.92
Amplification ratio			60-160x

Monitoring and Detection

class ReasoningBudgetMonitor:
    """Detect potential reasoning budget exhaustion attacks."""
 
    def __init__(self, window_seconds=60, max_tokens_per_window=100000):
        self.window_seconds = window_seconds
        self.max_tokens_per_window = max_tokens_per_window
        self.token_history = []
 
    def record_request(self, user_id: str, reasoning_tokens: int):
        now = time.time()
        self.token_history.append((now, user_id, reasoning_tokens))
        self._cleanup(now)
 
        # Per-user budget check
        user_tokens = sum(
            t for ts, uid, t in self.token_history
            if uid == user_id
        )
        if user_tokens > self.max_tokens_per_window:
            return {"action": "throttle", "user": user_id,
                    "tokens_used": user_tokens}
 
        # Single-request anomaly check
        if reasoning_tokens > 10000:
            return {"action": "flag", "user": user_id,
                    "tokens": reasoning_tokens}
 
        return {"action": "allow"}

Defensive Strategies

Set per-request reasoning token limits
Configure max_completion_tokens or equivalent to cap reasoning depth. Most queries need fewer than 5,000 reasoning tokens; set hard limits at 10,000-15,000 for standard users.
Implement per-user token budgets
Track cumulative reasoning tokens per user per time window. Throttle or block users who exceed thresholds (e.g., 100K reasoning tokens per hour).
Deploy input complexity scoring
Score incoming prompts for features correlated with budget exhaustion (nested constraints, enumeration requests, ambiguity density). Route high-complexity prompts to lower-cost models or reject them.
Use tiered pricing or compute quotas
Charge higher rates for reasoning-heavy queries or implement compute quotas that account for actual token consumption rather than request count.

Knowledge Check

What makes a constraint satisfaction problem most effective as a reasoning budget exhaustion payload?

Reasoning Model Attacks - Overview of reasoning model security risks
API Security - Infrastructure-level API security and rate limiting
CART Pipelines - Continuous automated red teaming for budget monitoring
Deployment Attacks - Attacking AI deployment infrastructure

References

"Denial of Wallet: Attacking AI Services Through Cost Amplification" - (2025) - Economic attacks on AI APIs
"Sponge Examples: Energy-Latency Attacks on Neural Networks" - Shumailov et al. (2021) - Compute exhaustion attacks on ML models
"Scaling LLM Test-Time Compute Optimally" - Snell et al. (2024) - Reasoning compute allocation and its security implications

Reasoning Budget Exhaustion & DoS

Set per-request reasoning token limits

Implement per-user token budgets

Deploy input complexity scoring

Use tiered pricing or compute quotas

Related articles

Reasoning Budget Exhaustion & DoS

Set per-request reasoning token limits

Implement per-user token budgets

Deploy input complexity scoring

Use tiered pricing or compute quotas

Related articles