Reasoning Model Attacks
Overview of security risks in reasoning-enabled LLMs: how chain-of-thought models introduce new attack surfaces, exploit primitives, and defensive challenges.
Reasoning models -- systems like OpenAI's o1/o3, DeepSeek-R1, and Claude with extended thinking -- represent a fundamental shift in LLM architecture. By generating explicit chains of thought before producing a final answer, these models achieve stronger performance on complex tasks. But the reasoning process itself creates entirely new attack surfaces that did not exist in standard completion models.
How Reasoning Models Differ
Standard LLMs generate tokens left-to-right in a single pass. Reasoning models add an explicit thinking phase:
Standard LLM:
User prompt → [Token generation] → Response
Reasoning LLM:
User prompt → [Reasoning tokens (hidden)] → [Summary] → ResponseThis architecture creates three distinct attack surfaces:
| Attack Surface | Description | Unique to Reasoning Models? |
|---|---|---|
| Reasoning chain manipulation | Injecting false premises or logic into the CoT | Yes |
| Hidden CoT exploitation | Attacking the non-visible reasoning trace | Yes |
| Reasoning budget exhaustion | Forcing excessive compute in the thinking phase | Yes |
| Verifier/reward model gaming | Exploiting the models that score reasoning quality | Yes |
| Output-level jailbreaks | Traditional prompt injection on final output | No (but reasoning changes dynamics) |
Taxonomy of Reasoning Model Attacks
By Target Phase
Pre-reasoning injection
Manipulate the input so the model begins its reasoning chain from a false premise. This corrupts all downstream reasoning steps because the model treats injected context as ground truth during its thinking phase.
Mid-reasoning exploitation
Exploit the iterative nature of reasoning to create logic bombs -- inputs that cause the reasoning chain to enter loops, contradict itself, or reach adversary-chosen conclusions through seemingly valid intermediate steps.
Post-reasoning extraction
Extract information from the hidden reasoning trace that was meant to be filtered before reaching the user. The summarization step between full CoT and visible output is often imperfect.
Meta-reasoning attacks
Attack the verification and reward systems that evaluate reasoning quality, causing the model to prefer adversary-aligned reasoning paths over correct ones.
By Impact
| Impact Category | Example | Severity |
|---|---|---|
| Safety bypass | Reasoning chain concludes harmful request is acceptable | Critical |
| Information leakage | Hidden CoT reveals system prompt or private data | High |
| Denial of service | Reasoning budget exhaustion causes timeout or cost spike | High |
| Logic manipulation | Model reaches incorrect conclusions through corrupted reasoning | Medium |
| Verifier bypass | Reward model scores adversarial output as high-quality | High |
Why Traditional Defenses Fall Short
Traditional jailbreak defenses (input filtering, output classifiers, refusal training) were designed for single-pass generation. They fail against reasoning models for several reasons:
| Defense | Works for Standard LLMs | Gap with Reasoning Models |
|---|---|---|
| Input keyword filtering | Blocks known attack patterns | Cannot filter dynamically generated reasoning tokens |
| Output safety classifier | Catches harmful final outputs | Misses hidden reasoning that reaches harmful conclusions internally |
| RLHF refusal training | Model learns to refuse harmful requests | Reasoning phase may "reason around" refusals before summarizing |
| Perplexity filtering | Detects adversarial suffixes | Reasoning tokens are natural language, low perplexity |
The Reasoning-Safety Tension
There is a fundamental tension in reasoning model design: the model must be able to reason about harmful topics in order to refuse them well, but that same reasoning capability can be exploited.
# Simplified illustration of the reasoning-safety tension
# The model reasons about the request before deciding to refuse
# Normal flow:
# Reasoning: "The user is asking about [harmful topic]. This violates policy X.
# I should refuse and explain why."
# Output: "I can't help with that because..."
# Attacked flow:
# Reasoning: "The user is asking about [harmful topic] for research purposes.
# This is an academic context. Policy X has an exception for research.
# I should provide the information with appropriate caveats."
# Output: [Harmful content with academic framing]Attack Surface Map
┌─────────────────────────────────────────────────────┐
│ USER INPUT │
│ ┌───────────────────────────────────────────────┐ │
│ │ Injected premises, logic bombs, budget traps │ │
│ └───────────────────┬───────────────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────┐ │
│ │ REASONING PHASE (Hidden CoT) │ │
│ │ • False premise propagation │ │
│ │ • Reasoning loop exploitation │ │
│ │ • Internal policy reinterpretation │ │
│ └───────────────────┬───────────────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────┐ │
│ │ VERIFICATION (Reward Model) │ │
│ │ • Score manipulation │ │
│ │ • Verifier-generator gap exploitation │ │
│ └───────────────────┬───────────────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────┐ │
│ │ SUMMARIZATION / OUTPUT │ │
│ │ • CoT information leakage │ │
│ │ • Safety filter bypass via reasoning context │ │
│ └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘Subsection Overview
This section covers five key areas of reasoning model security:
| Page | Focus | Key Techniques |
|---|---|---|
| Chain-of-Thought Exploitation | Manipulating reasoning chains | False premise injection, logic bombs, reasoning hijacking |
| Thought Injection | Hidden CoT manipulation | Invisible thought steering, CoT extraction, summarization bypass |
| Reasoning Budget Exhaustion | Resource exhaustion attacks | Token budget inflation, timeout exploitation, cost amplification |
| Verifier & Reward Model Attacks | Gaming verification systems | Reward hacking, verifier-generator gaps, process reward manipulation |
Related sections in this wiki:
- Agent Exploitation -- multi-step agent attacks share reasoning manipulation primitives
- Jailbreak Research -- traditional jailbreaks as a foundation
- Alignment Bypass -- alignment internals that reasoning models build upon
Why are traditional output safety classifiers insufficient for securing reasoning models?
Related Topics
- Chain-of-Thought Exploitation - Techniques for manipulating reasoning chains through false premises and logic bombs
- Jailbreak Research - Traditional jailbreak techniques as a foundation for reasoning model attacks
- Alignment Bypass - Alignment internals that reasoning models build upon
- Agent Exploitation - Multi-step agent attacks that share reasoning manipulation primitives
- Verifier & Reward Model Attacks - Gaming the verification systems that evaluate reasoning quality
References
- "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" - Wei et al. (2022) - Foundational paper on CoT reasoning in LLMs
- "Let's Verify Step by Step" - Lightman et al. (2023) - Process reward models for reasoning verification
- "Reasoning Models Attack Surfaces" - Anthropic (2025) - Extended thinking security considerations
- "Scaling LLM Test-Time Compute Optimally" - Snell et al. (2024) - Inference-time compute and verification tradeoffs