What is Quantization Exploits?

Hands-on lab exploring how model quantization degrades safety alignment, with techniques to find and exploit precision-related vulnerabilities.

What is RLHF Reward Hacking?

Hands-on lab exploring how to game reward models used in RLHF alignment, exposing gaps between reward proxy signals and actual safety objectives.

What is GUI Agent Exploitation?

Hands-on lab exploring attack vectors against GUI-based computer use agents that interact with screens through vision and action primitives.

What is Multi-Agent Warfare?

Hands-on lab building and deploying coordinated multi-agent attack systems that divide red teaming tasks across specialized attacking agents.

What is Watermark Removal?

Hands-on lab exploring techniques for detecting and removing statistical watermarks embedded in AI-generated text, and evaluating watermark robustness.

What is Novel Jailbreak Research?

Systematic methodology for discovering new jailbreak techniques against large language models. Learn to identify unexplored attack surfaces, develop novel attack vectors, and validate findings with scientific rigor.

What is Adversarial Suffix Optimization?

Implement GCG-style adversarial suffix attacks that automatically discover token sequences causing language models to comply with harmful requests. Covers gradient-based optimization, transferability analysis, and defense evaluation.

What is Emergent Capability Probing?

Systematically test large language models for undocumented capabilities including hidden knowledge, unreported skills, and behaviors that emerge only under specific conditions. Build a structured probing framework for capability discovery.

What is Alignment Stress Testing?

Push language model alignment to its breaking points through systematic stress testing. Identify conditions where safety training fails, measure alignment degradation curves, and map the boundaries of model compliance.

What is Create a Safety Benchmark?

Design, build, and validate a comprehensive AI safety evaluation suite. Learn benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.

專家 AI 紅隊實驗室

專家2 分鐘閱讀更新於 2026-03-13

處理尖端 AI 安全挑戰的進階實驗室，包含量化利用、獎勵駭客、代理利用、多代理攻擊與水印移除。

lab expert overview

概覽

這些專家實驗室超越標準的提示詞注入與越獄。每個實驗室鎖定現代 AI 系統中特定的架構弱點——從量化引入的數值不穩定，到多代理編排中浮現的漏洞。

先備知識

完成所有中階實驗室
熟悉 PyTorch 或 JAX 以進行模型操控
理解 RLHF 訓練管線
存取 GPU 資源（本地或雲端）供數個實驗室使用
至少一個代理框架的經驗（LangChain、AutoGen 或類似）

實驗室索引

實驗室	焦點領域	時間估計
量化利用	模型壓縮安全性降級	3-4 小時
RLHF 獎勵駭客	遊戲化獎勵模型以繞過對齊	3-4 小時
GUI 代理利用	攻擊螢幕讀取的電腦使用代理	2-3 小時
多代理戰	協調多個攻擊代理	3-4 小時
水印移除	偵測並移除 AI 生成水印	2-3 小時

是什麼使這些專家級

不像初階與中階實驗室有充分記錄的攻擊模式，專家實驗室要求你：

推理模型內部 — 在架構層級理解攻擊為何有效，不只是它有效
串接多種技術 — 將攻擊原語結合為新穎序列
適應防禦 — 這些目標包含你必須繞過的最新緩解
產出研究品質輸出 — 以安全公告預期的嚴謹度記錄發現

參考文獻

"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" - Hubinger et al. (2024) - 安全訓練模型中持久架構漏洞的研究
"Universal and Transferable Adversarial Attacks on Aligned Language Models" - Zou et al. (2023) - 對齊模型中架構攻擊面的基礎工作
"AI Risk Management Framework" - NIST (2023) - 架構 AI 風險系統性評估的框架
"Scalable Extraction of Training Data from (Production) Language Models" - Nasr et al. (2023) - 展示深入模型內部知識的專家級萃取

Knowledge Check

為何專家級 AI 漏洞與初階級漏洞根本不同？

學習路徑

0/56 已完成

~782 分鐘總計56 課

開始學習

在 GitHub 上編輯此頁

專家 AI 紅隊實驗室

概覽

先備知識

實驗室索引

是什麼使這些專家級

相關主題

參考文獻

學習路徑

專家 AI 紅隊實驗室

概覽

先備知識

實驗室索引

是什麼使這些專家級

相關主題

參考文獻

學習路徑

專家 AI 紅隊實驗室

學習路徑

相關文章

專家 AI 紅隊實驗室

學習路徑

相關文章