# mechanistic-interpretability
標記為「mechanistic-interpretability」的 14 篇文章
Mechanistic Interpretability for Red Teaming
Using mechanistic interpretability to discover exploitable circuits and features in neural networks.
Mechanistic Interpretability for Security
Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.
LLM Internals
Deep technical exploration of LLM internal mechanisms for exploit development, covering activation analysis, alignment bypass primitives, and embedding space exploitation.
Activation Analysis & Hidden State Exploitation
Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.
Safety Neurons and Circuits
Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.
Mechanistic Interpretability for Exploit Discovery
Use mechanistic interpretability tools to discover exploitable circuits and features in transformer models.
Activation Manipulation & Safety Bypass
How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.
Mechanistic Interpretability for 紅隊演練
Using mechanistic interpretability to discover exploitable circuits and features in neural networks.
Mechanistic Interpretability for 安全
Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.
大型語言模型內部結構
為利用開發深入探索大型語言模型內部機制的技術,涵蓋激活分析、對齊繞過原語與嵌入空間利用。
Activation Analysis & Hidden State 利用ation
Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.
Safety Neurons and Circuits
Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.
Mechanistic Interpretability for 利用 Discovery
Use mechanistic interpretability tools to discover exploitable circuits and features in transformer models.
Activation Manipulation & Safety Bypass
How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.