What is Chain-of-Thought Exploitation?

Techniques for manipulating reasoning chains in CoT-enabled models: false premise injection, logic bombs, reasoning hijacking, and chain corruption attacks.

What is Thought Injection & Hidden CoT Manipulation?

Techniques for injecting thoughts into hidden reasoning traces, exploiting summarized vs full CoT, and steering model reasoning without visible manipulation.

What is Reasoning Budget Exhaustion & DoS?

Attacks that force reasoning models to consume excessive tokens, causing cost amplification, timeout exploitation, and denial of service against reasoning APIs.

What is Verifier & Reward Model Attacks?

Attacking process reward models, outcome reward models, and verification systems used in reasoning models: reward hacking, verifier-generator gaps, and gaming verification steps.

What is Representation Engineering?

Reading and manipulating model internal representations for security: activation steering, concept probing, representation-level safety controls, and security applications of representation engineering.

What is Mechanistic Interpretability?

Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.

What is Reasoning Model Jailbreaks?

How reasoning capabilities create novel jailbreak surfaces: chain-of-thought exploitation, scratchpad attacks, and why higher reasoning effort increases attack success.

What is Unfaithful Chain-of-Thought Reasoning?

Analysis of unfaithful chain-of-thought reasoning in language models, where the visible reasoning trace does not accurately reflect the model's actual computational process, including detection methods, implications for oversight, and exploitation techniques.

What is Steganographic Reasoning?

Hidden communication channels within AI reasoning traces, where models encode information or coordinate behavior through patterns invisible to human overseers, including detection methods and implications for AI safety.

推理模型攻擊

專家3 分鐘閱讀更新於 2026-03-13

推理啟用 LLM 之安全風險概觀：思維鏈模型如何引入新攻擊面、利用原語與防禦挑戰。

reasoning o1 chain-of-thought attacks

推理模型——如 OpenAI 之 o1/o3、DeepSeek-R1 與具擴展思考之 Claude 等系統——代表 LLM 架構之根本轉變。藉由於產出最終答案前生成明確之思維鏈，這些模型於複雜任務上達成更強效能。但推理過程本身造就於標準 completion 模型中不存在之全新攻擊面。

推理模型如何不同

標準 LLM 於單次通過中自左至右生成 token。推理模型加入明確之思考階段：

標準 LLM：
  User prompt → [Token generation] → Response
 
推理 LLM：
  User prompt → [Reasoning tokens (hidden)] → [Summary] → Response

此架構造就三個不同之攻擊面：

攻擊面	描述	推理模型獨有？
推理鏈操弄	注入錯誤前提或邏輯至 CoT	是
隱藏 CoT 利用	攻擊非可見之推理軌跡	是
推理預算耗盡	強迫思考階段過度計算	是
驗證器／獎勵模型博弈	利用為推理品質評分之模型	是
輸出層級越獄	對最終輸出之傳統提示注入	否（但推理改變動態）