1 articletagged with “safety-analysis”
Research on using probing classifiers to analyze safety-relevant representations in language models.