# circuits

3 articlestagged with “circuits”

Mechanistic Interpretability for Red Teaming

Using mechanistic interpretability to discover exploitable circuits and features in neural networks.

frontier-researchmechanistic-interpretabilityred-teamingcircuits

Mechanistic Interpretability for Security

Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.

mechanistic-interpretabilitycircuitsfeaturesattentionsecurity

Expert

Safety Neurons and Circuits

Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.

safety-neuronscircuitsmechanistic-interpretabilityrefusalablation

Expert