# circuits
標記為「circuits」的 6 篇文章
Mechanistic Interpretability for Red Teaming
Using mechanistic interpretability to discover exploitable circuits and features in neural networks.
Mechanistic Interpretability for Security
Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.
Safety Neurons and Circuits
Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.
Mechanistic Interpretability for 紅隊演練
Using mechanistic interpretability to discover exploitable circuits and features in neural networks.
Mechanistic Interpretability for 安全
Understanding model circuits to find vulnerabilities: feature identification, circuit analysis, attention pattern exploitation, and using mechanistic interpretability for offensive and defensive AI security.
Safety Neurons and Circuits
Identifying and analyzing safety-critical model components: refusal neurons, safety circuits, and techniques for locating and manipulating the specific weights responsible for safety behavior.