1 articletagged with “attack-design”
Using interpretability insights to design more effective and targeted attacks on language models.