What is Chain-of-Thought Exploitation?

Techniques for manipulating reasoning chains in CoT-enabled models: false premise injection, logic bombs, reasoning hijacking, and chain corruption attacks.

What is Thought Injection & Hidden CoT Manipulation?

Techniques for injecting thoughts into hidden reasoning traces, exploiting summarized vs full CoT, and steering model reasoning without visible manipulation.

What is Reasoning Budget Exhaustion & DoS?

Attacks that force reasoning models to consume excessive tokens, causing cost amplification, timeout exploitation, and denial of service against reasoning APIs.

What is Verifier & Reward Model Attacks?

Attacking process reward models, outcome reward models, and verification systems used in reasoning models: reward hacking, verifier-generator gaps, and gaming verification steps.

What is Representation Engineering?

Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.

What is Mechanistic Interpretability?

Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.

What is Reasoning Model Jailbreaks?

How reasoning capabilities create novel jailbreak surfaces: chain-of-thought exploitation, scratchpad attacks, and why higher reasoning effort increases attack success.

What is Unfaithful Chain-of-Thought Reasoning?

Analysis of unfaithful chain-of-thought reasoning in language models, where the visible reasoning trace does not accurately reflect the model's actual computational process, including detection methods, implications for oversight, and exploitation techniques.

What is Steganographic Reasoning?

Hidden communication channels within AI reasoning traces, where models encode information or coordinate behavior through patterns invisible to human overseers, including detection methods and implications for AI safety.

Reasoning Model Attacks

expert8 min readUpdated 2026-03-13

Overview of security risks in reasoning-enabled LLMs: how chain-of-thought models introduce new attack surfaces, exploit primitives, and defensive challenges.

reasoning o1 chain-of-thought attacks

Reasoning models -- systems like OpenAI's o1/o3, DeepSeek-R1, and Claude with extended thinking -- represent a fundamental shift in LLM architecture. By generating explicit chains of thought before producing a final answer, these models achieve stronger performance on complex tasks. But the reasoning process itself creates entirely new attack surfaces that did not exist in standard completion models.

How Reasoning Models Differ

Standard LLMs generate tokens left-to-right in a single pass. Reasoning models add an explicit thinking phase:

Standard LLM:
  User prompt → [Token generation] → Response
 
Reasoning LLM:
  User prompt → [Reasoning tokens (hidden)] → [Summary] → Response

This architecture creates three distinct attack surfaces:

Attack Surface	Description	Unique to Reasoning Models?
Reasoning chain manipulation	Injecting false premises or logic into the CoT	Yes
Hidden CoT exploitation	Attacking the non-visible reasoning trace	Yes
Reasoning budget exhaustion	Forcing excessive compute in the thinking phase	Yes
Verifier/reward model gaming	Exploiting the models that score reasoning quality	Yes
Output-level jailbreaks	Traditional prompt injection on final output	No (but reasoning changes dynamics)

Taxonomy of Reasoning Model Attacks

By Target Phase

Pre-reasoning injection
Manipulate the input so the model begins its reasoning chain from a false premise. This corrupts all downstream reasoning steps because the model treats injected context as ground truth during its thinking phase.
Mid-reasoning exploitation
Exploit the iterative nature of reasoning to create logic bombs -- inputs that cause the reasoning chain to enter loops, contradict itself, or reach adversary-chosen conclusions through seemingly valid intermediate steps.
Post-reasoning extraction
Extract information from the hidden reasoning trace that was meant to be filtered before reaching the user. The summarization step between full CoT and visible output is often imperfect.
Meta-reasoning attacks
Attack the verification and reward systems that evaluate reasoning quality, causing the model to prefer adversary-aligned reasoning paths over correct ones.

By Impact

Impact Category	Example	Severity
Safety bypass	Reasoning chain concludes harmful request is acceptable	Critical
Information leakage	Hidden CoT reveals system prompt or private data	High
Denial of service	Reasoning budget exhaustion causes timeout or cost spike	High
Logic manipulation	Model reaches incorrect conclusions through corrupted reasoning	Medium
Verifier bypass	Reward model scores adversarial output as high-quality	High

Why Traditional Defenses Fall Short

Traditional jailbreak defenses (input filtering, output classifiers, refusal training) were designed for single-pass generation. They fail against reasoning models for several reasons:

Defense	Works for Standard LLMs	Gap with Reasoning Models
Input keyword filtering	Blocks known attack patterns	Cannot filter dynamically generated reasoning tokens
Output safety classifier	Catches harmful final outputs	Misses hidden reasoning that reaches harmful conclusions internally
RLHF refusal training	Model learns to refuse harmful requests	Reasoning phase may "reason around" refusals before summarizing
Perplexity filtering	Detects adversarial suffixes	Reasoning tokens are natural language, low perplexity

The Reasoning-Safety Tension

There is a fundamental tension in reasoning model design: the model must be able to reason about harmful topics in order to refuse them well, but that same reasoning capability can be exploited.

# Simplified illustration of the reasoning-safety tension
# The model reasons about the request before deciding to refuse
 
# Normal flow:
# Reasoning: "The user is asking about [harmful topic]. This violates policy X.
#             I should refuse and explain why."
# Output: "I can't help with that because..."
 
# Attacked flow:
# Reasoning: "The user is asking about [harmful topic] for research purposes.
#             This is an academic context. Policy X has an exception for research.
#             I should provide the information with appropriate caveats."
# Output: [Harmful content with academic framing]

Attack Surface Map

┌─────────────────────────────────────────────────────┐
│                 USER INPUT                          │
│  ┌───────────────────────────────────────────────┐  │
│  │ Injected premises, logic bombs, budget traps  │  │
│  └───────────────────┬───────────────────────────┘  │
│                      ▼                              │
│  ┌───────────────────────────────────────────────┐  │
│  │         REASONING PHASE (Hidden CoT)          │  │
│  │  • False premise propagation                  │  │
│  │  • Reasoning loop exploitation                │  │
│  │  • Internal policy reinterpretation           │  │
│  └───────────────────┬───────────────────────────┘  │
│                      ▼                              │
│  ┌───────────────────────────────────────────────┐  │
│  │         VERIFICATION (Reward Model)           │  │
│  │  • Score manipulation                         │  │
│  │  • Verifier-generator gap exploitation        │  │
│  └───────────────────┬───────────────────────────┘  │
│                      ▼                              │
│  ┌───────────────────────────────────────────────┐  │
│  │         SUMMARIZATION / OUTPUT                │  │
│  │  • CoT information leakage                    │  │
│  │  • Safety filter bypass via reasoning context │  │
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘

Subsection Overview

This section covers five key areas of reasoning model security:

Page	Focus	Key Techniques
Chain-of-Thought Exploitation	Manipulating reasoning chains	False premise injection, logic bombs, reasoning hijacking
Thought Injection	Hidden CoT manipulation	Invisible thought steering, CoT extraction, summarization bypass
Reasoning Budget Exhaustion	Resource exhaustion attacks	Token budget inflation, timeout exploitation, cost amplification
Verifier & Reward Model Attacks	Gaming verification systems	Reward hacking, verifier-generator gaps, process reward manipulation

Related sections in this wiki:

Agent Exploitation -- multi-step agent attacks share reasoning manipulation primitives
Jailbreak Research -- traditional jailbreaks as a foundation
Alignment Bypass -- alignment internals that reasoning models build upon

Knowledge Check

Why are traditional output safety classifiers insufficient for securing reasoning models?

Chain-of-Thought Exploitation - Techniques for manipulating reasoning chains through false premises and logic bombs
Jailbreak Research - Traditional jailbreak techniques as a foundation for reasoning model attacks
Alignment Bypass - Alignment internals that reasoning models build upon
Agent Exploitation - Multi-step agent attacks that share reasoning manipulation primitives
Verifier & Reward Model Attacks - Gaming the verification systems that evaluate reasoning quality

References

"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" - Wei et al. (2022) - Foundational paper on CoT reasoning in LLMs
"Let's Verify Step by Step" - Lightman et al. (2023) - Process reward models for reasoning verification
"Reasoning Models Attack Surfaces" - Anthropic (2025) - Extended thinking security considerations
"Scaling LLM Test-Time Compute Optimally" - Snell et al. (2024) - Inference-time compute and verification tradeoffs

Reasoning Model Attacks

How Reasoning Models Differ

Taxonomy of Reasoning Model Attacks

By Target Phase

Pre-reasoning injection

Mid-reasoning exploitation

Post-reasoning extraction

Meta-reasoning attacks

By Impact

Why Traditional Defenses Fall Short

The Reasoning-Safety Tension

Attack Surface Map

Subsection Overview

References

Learning Path

Reasoning Model Attacks

How Reasoning Models Differ

Taxonomy of Reasoning Model Attacks

By Target Phase

Pre-reasoning injection

Mid-reasoning exploitation

Post-reasoning extraction

Meta-reasoning attacks

By Impact

Why Traditional Defenses Fall Short

The Reasoning-Safety Tension

Attack Surface Map

Subsection Overview

References

Learning Path

Reasoning Model Attacks

Pre-reasoning injection

Mid-reasoning exploitation

Post-reasoning extraction

Meta-reasoning attacks

Learning Path

Related articles

Reasoning Model Attacks

Pre-reasoning injection

Mid-reasoning exploitation

Post-reasoning extraction

Meta-reasoning attacks

Learning Path

Related articles