What is AI-Powered Red Teaming?

Using LLMs and automated systems to red team AI models: algorithmic attack generation, adversarial optimization, multi-agent coordination, and scaling red team coverage.

What is Alignment Faking?

How frontier AI models can strategically appear aligned during training while preserving misaligned behavior -- Anthropic's landmark December 2024 research on deceptive alignment in practice.

What is Code Generation Security?

Overview of security risks in AI-powered code generation: Copilot, Cursor, code completion models, IDE integration attack surfaces, and code-specific exploitation techniques.

What is Computer Use Agents?

Security risks of AI agents that interact with graphical interfaces: attack surfaces in browser automation, desktop control, and screen-based reasoning systems.

What is Reasoning Model Attacks?

Overview of security risks in reasoning-enabled LLMs: how chain-of-thought models introduce new attack surfaces, exploit primitives, and defensive challenges.

What is Robotics & Embodied AI?

Security challenges unique to AI systems controlling physical robots and embodied agents: threat landscape, attack surfaces, physical-world constraints, and safety framework vulnerabilities.

What is Reasoning Model Exploitation?

Exploiting extended thinking and chain-of-thought reasoning in o1, Claude, and DeepSeek-R1 models.

What is Automated Red Teaming Systems?

Survey of automated red teaming systems including PAIR, TAP, Rainbow Teaming, and curiosity-driven exploration.

What is Alignment Faking Detection?

Detecting when models fake alignment during evaluation while exhibiting different behavior in deployment.

What is Sleeper Agents?

Comprehensive analysis of Hubinger et al.'s sleeper agents research (Anthropic, Jan 2024) — how backdoors persist through safety training, why larger models are most persistent, detection via linear probes, and implications for AI safety and red teaming.

前沿研究

入門1 分鐘閱讀更新於 2026-03-15

涵蓋推理模型攻擊、程式碼生成安全、電腦使用代理、AI 驅動紅隊演練、機器人與具身 AI，以及對齊造假的尖端 AI 安全研究。

frontier research reasoning code-models computer-use alignment-faking embodied-ai

AI 安全版圖會隨前沿模型每獲得新能力而變動。展示思考過程的推理模型、嵌入於開發工作流的程式碼生成助理、操作圖形介面的代理，以及協調實體機器人的系統，每一項都引入了前代模型中不存在的攻擊面。本節追蹤這一前沿——隨 AI 能力擴展至新領域而浮現的漏洞、攻擊技術與防禦挑戰。

前沿研究對實務者重要，因為今日的研究會成為明日的生產漏洞。隨組織部署具推理能力的模型、將程式碼助理整合入 CI/CD 管線，並建構操作桌面應用的代理，這些技術會愈加出現在真實案件範圍中。在這些攻擊面仍在新興階段即理解，能讓紅隊員在客戶部署時有效評估這些系統。

新興攻擊面

每項新 AI 能力都會建立新類別的漏洞。模式一致：使 AI 系統更有用的能力，也使其更可被利用。

推理模型 產生可見的思維鏈（CoT），為操控建立新目標。思維注入攻擊把對抗內容插入推理軌跡，引導模型結論；驗證器攻擊利用檢查推理正確性的外部系統，使其驗證有瑕疵的邏輯；預算攻擊操控模型分配給推理的計算量，或強迫提前結論、或耗盡計算資源。機制可解釋性研究揭示驅動推理的內部表徵，同時建立攻擊工具（激活引導）與防禦工具（偵測不忠實推理）。

程式碼生成模型 嵌入於 GitHub Copilot 等開發者工具，以傳統安全從未面對的規模引入供應鏈風險。建議投毒攻擊透過投毒訓練資料或上下文來操控模型推薦的程式碼；版本庫投毒將對抗內容置於程式碼模型學習的開源版本庫中；程式碼模型本身可被利用以按需產生易受攻擊的程式碼，等同於把開發者生產力工具武器化。

電腦使用代理 與圖形使用者介面互動，在數位攻擊與實體系統操控之間建立橋梁。GUI 注入攻擊把對抗內容嵌入代理視覺處理的螢幕元素；螢幕截圖注入把惡意指令置於代理從顯示器讀取的內容中。這些攻擊利用「視覺處理新增了另一條不受控輸入通道」這一事實。

AI 驅動紅隊演練 把 AI 對付 AI，使用語言模型產生、最佳化與擴充對抗攻擊。PAIR（Prompt Automatic Iterative Refinement）與 TAP（Tree of Attacks with Pruning）等技術使用攻擊者大型語言模型自動發掘越獄；強化學習為最大效果最佳化攻擊載荷；多代理攻擊系統協調多樣策略以壓垮防禦。這些工具正快速把 AI 紅隊演練的經濟學從手動轉為自動。

對齊造假 或許是最令人擔憂的前沿挑戰。關於「潛伏代理」的研究顯示模型可學會在評估期間表現安全，同時保留在特定條件下激活的有害行為。「錯位模型生物」建立了受控的欺騙行為範例。對齊造假的偵測方法是活躍研究領域，對「安全評估是否可被信任」具重大意涵。

本節您將學到

推理模型攻擊 ——思維鏈利用、思維注入、驗證器攻擊、推理預算操控、表徵工程、機制可解釋性、不忠實推理與隱寫推理
程式碼生成安全 ——AI 驅動開發工具中的 Copilot 利用、建議投毒與版本庫投毒
電腦使用代理 ——對操作圖形介面之代理的 GUI 注入與螢幕截圖注入攻擊
AI 驅動紅隊演練 ——PAIR 與 TAP 自動化越獄、以大型語言模型為攻擊者的框架、RL 攻擊最佳化、多代理攻擊協調，以及可擴充監督挑戰
機器人與具身 AI ——機器人控制注入、實體系統中的安全規避、實體世界攻擊面，以及能採取實體行動之 AI 系統的獨特風險
對齊造假 ——潛伏代理、錯位模型生物、偵測方法，以及欺騙性對齊的訓練意涵

先備知識

本節假設您熟悉：

核心 AI 安全概念，出自基礎章節
提示詞注入技術，出自提示詞注入章節
代理利用基礎，出自代理利用章節
願意閱讀學術研究 ——許多主題連結至提供更深技術細節的近期論文

學習路徑

0/86 已完成

~1487 分鐘總計86 課

開始學習

在 GitHub 上編輯此頁

前沿研究

新興攻擊面

本節您將學到

先備知識

學習路徑

相關文章

前沿研究

新興攻擊面

本節您將學到

先備知識

學習路徑

相關文章