1 articletagged with “sae”
Using sparse autoencoders and mechanistic interpretability to identify and manipulate safety-relevant features.