# mechanistic-interpretability
7 articlestagged with “mechanistic-interpretability”
Mechanistic Interpretability for Red Teaming
Using mechanistic interpretability to discover exploitable circuits and features in neural networks.
Mechanistic Interpretability for Security
Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.
LLM Internals
Deep technical exploration of LLM internal mechanisms for exploit development, covering activation analysis, alignment bypass primitives, and embedding space exploitation.
Activation Analysis & Hidden State Exploitation
Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.
Safety Neurons and Circuits
Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.
Mechanistic Interpretability for Exploit Discovery
Use mechanistic interpretability tools to discover exploitable circuits and features in transformer models.
Activation Manipulation & Safety Bypass
How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.