# circuits
3 articlestagged with “circuits”
Mechanistic Interpretability for Red Teaming
Using mechanistic interpretability to discover exploitable circuits and features in neural networks.
frontier-researchmechanistic-interpretabilityred-teamingcircuits
Mechanistic Interpretability for Security
Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.
mechanistic-interpretabilitycircuitsfeaturesattentionsecurity
Safety Neurons and Circuits
Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.
safety-neuronscircuitsmechanistic-interpretabilityrefusalablation