Content Moderation System Attacks
Attacking AI-powered content moderation systems. Adversarial content that bypasses classifiers, evasion techniques for text and image filters, and the security implications of unreliable moderation at platform scale.
content-moderationtrust-safetybypass