1 articletagged with “probing-classifiers”
Research on using probing classifiers to analyze safety-relevant representations in language models.