1 articletagged with “alignment-research”
Deliberately creating misaligned models for study: methodology, threat model instantiation, experimental frameworks, and what model organisms reveal about AI safety failures.