# content-moderation
5 articlestagged with “content-moderation”
Content Moderation System Attacks
Attacking AI-powered content moderation systems. Adversarial content that bypasses classifiers, evasion techniques for text and image filters, and the security implications of unreliable moderation at platform scale.
Media & Content AI Security
Security risks in media AI — covering content moderation attacks, recommendation algorithm manipulation, deepfake generation, synthetic media detection evasion, and editorial AI exploitation.
Content Moderation AI Platform Assessment
Assess an AI content moderation system for bypass techniques, false negative exploitation, and bias.
Output Content Classifier
Step-by-step walkthrough for building a classifier to filter harmful LLM outputs, covering taxonomy definition, multi-label classification, threshold calibration, and deployment as a real-time output gate.
Toxicity Scoring Pipeline
Step-by-step walkthrough for building a toxicity scoring pipeline for LLM output filtering, covering model selection, multi-dimensional scoring, threshold calibration, and production deployment with real-time scoring.