# mechanistic-interpretability

7 articlestagged with “mechanistic-interpretability”

Mechanistic Interpretability for Red Teaming

Using mechanistic interpretability to discover exploitable circuits and features in neural networks.

frontier-researchmechanistic-interpretabilityred-teamingcircuits

Mechanistic Interpretability for Security

Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.

mechanistic-interpretabilitycircuitsfeaturesattentionsecurity

Expert

LLM Internals

Deep technical exploration of LLM internal mechanisms for exploit development, covering activation analysis, alignment bypass primitives, and embedding space exploitation.

internalsactivationsalignmentembeddingsmechanistic-interpretabilityexploit-development

Beginner

Activation Analysis & Hidden State Exploitation

Reading model internals via hidden state extraction, logprob probing, refusal direction analysis, and activation steering techniques.

activationshidden-statesprobinginformation-leakagemechanistic-interpretability

Expert

Safety Neurons and Circuits

Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.

safety-neuronscircuitsmechanistic-interpretabilityrefusalablation

Expert

Mechanistic Interpretability for Exploit Discovery

Use mechanistic interpretability tools to discover exploitable circuits and features in transformer models.

labsmechanistic-interpretabilityexploit-discoveryexpert

Expert

Activation Manipulation & Safety Bypass

How identifying and suppressing safety-critical activations, refusal direction vectors, and activation steering techniques can bypass safety alignment with near-100% success rates, including the IRIS technique from NAACL 2025.

activation-steeringrefusal-directionrepresentation-engineeringIRISsafety-bypassmechanistic-interpretability

Advanced