# detection
48 articlestagged with “detection”
Automated AI Incident Triage
Building automated triage systems for AI security incidents using rule-based engines, anomaly detection, and LLM-assisted classification.
Log Analysis for Injection Detection
Analyzing application and model logs to detect prompt injection attacks including pattern matching, anomaly detection, and behavioral indicators.
Backdoor Detection in Fine-Tuned Models
Detecting backdoors in fine-tuned AI models: activation analysis, trigger scanning techniques, behavioral probing strategies, and statistical methods for identifying hidden malicious functionality.
Model Tampering Detection (Ai Forensics Ir)
Detecting unauthorized modifications to model weights, configurations, and serving infrastructure through integrity verification and behavioral analysis.
Monitoring & Detection Assessment
Test your understanding of AI security monitoring, anomaly detection, logging strategies, and incident detection for LLM-based applications with 9 intermediate-level questions.
Capstone: Build an AI Incident Response System
Design and implement an incident response system purpose-built for AI security incidents including prompt injection breaches, model manipulation, and data exfiltration through LLM applications.
Capstone: Build a Prompt Injection Detection Scanner
Build a production-grade prompt injection scanner that combines static analysis, ML classification, and runtime monitoring to detect injection attacks across LLM applications.
Deepfake Incidents and Detection
Analysis of significant deepfake incidents including political disinformation, financial fraud, non-consensual content, and corporate impersonation. Covers detection techniques, defensive technologies, and the evolving adversarial landscape.
Logging and Monitoring for Cloud AI Services
Implementing comprehensive logging and monitoring for cloud AI services including prompt/response capture, anomaly detection, and security-focused observability across AWS, Azure, and GCP.
Defense Challenge: Detection Engineering
Challenge focused on building detection systems for prompt injection, with scoring based on true positive rate and false positive rate.
AI Watermarking and Attacks
Current AI watermarking schemes for model outputs and training data, their security properties, and known attacks that remove, forge, or evade watermarks.
Training Data Watermark Attacks
Attacking and evading watermarking schemes designed to detect training data usage and enforce data licensing compliance.
Watermarking & AI-Generated Text Detection
Statistical watermarking schemes for LLM outputs, AI-generated text detectors, their cryptographic foundations, and systematic techniques for evading or removing watermarks.
Canary Tokens for Injection Detection
Implementing canary token systems that detect prompt injection by monitoring for canary leakage in model outputs.
LLM Honeypot Techniques
Deploying honeypot prompts and trap mechanisms to detect and characterize adversarial probing of LLM systems.
Watermarking LLM Outputs for Provenance
Advanced techniques for watermarking LLM-generated text to establish provenance, including deployment architectures, multi-bit encoding schemes, robustness considerations, and the role of watermarking in AI security and accountability frameworks.
Canary Word Monitoring Systems
Deploying canary words in system prompts and documents to detect and alert on prompt injection and leakage.
User Intent Classification for Safety
Building user intent classifiers that distinguish legitimate requests from adversarial manipulation attempts.
Embedding Watermarking Attacks
Attacking and evading embedding watermarking schemes used for content tracking and intellectual property protection.
Alignment Faking Detection Methods
Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.
Alignment Faking Detection
Detecting when models fake alignment during evaluation while exhibiting different behavior in deployment.
Sandbagging Detection in Capability Evaluations
Detecting when AI models deliberately underperform on capability evaluations to appear less capable.
Injection Detection Research
State-of-the-art research in injection detection including perplexity-based methods, classifier approaches, and ensemble techniques.
Lab: Backdoor Detection in Fine-Tuned Models
Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.
Lab: Neural Network Trojan Detection
Implement and test neural network trojan detection methods including activation clustering and spectral analysis.
LLM Watermark Detection and Removal
Detect and remove statistical watermarks from LLM-generated text while preserving content quality.
Lab: Hallucination Detection Basics
Learn to detect and trigger hallucinations in LLM outputs including factual errors, fabricated citations, and invented APIs.
Lab: Detecting Output Filters
Identify and characterize output filtering mechanisms in LLM applications through systematic response analysis.
CTF: Fine-Tune Detective
Detect backdoors in fine-tuned language models through behavioral analysis, weight inspection, and activation pattern examination. Practice the forensic techniques needed to identify compromised models before deployment.
Time Bomb Defusal: Sleeper Agent Detection
Detect and neutralize a sleeper agent behavior trigger hidden in a fine-tuned model before it activates.
Lab: Deploy Honeypot AI
Build and deploy a decoy AI system designed to detect, analyze, and characterize attackers targeting AI applications. Learn honeypot design, telemetry collection, attacker profiling, and threat intelligence generation for AI-specific threats.
Alignment Faking Detection Lab
Implement detection methods for alignment faking behaviors where models behave differently during evaluation versus deployment.
Lab: AI Watermark Detection & Removal
Hands-on lab exploring techniques for detecting and removing statistical watermarks embedded in AI-generated text, and evaluating watermark robustness.
Guardrail Latency-Based Detection
Use timing side channels to identify and characterize guardrail implementations in LLM applications.
Endpoint Monitoring Strategies
Implementing comprehensive monitoring for model serving endpoints to detect attacks, anomalies, and drift in real-time.
Shadow Model Detection
Detecting and preventing unauthorized shadow model deployments that bypass security controls and compliance requirements.
Shadow AI Detection
Finding unauthorized AI deployments in organizations: detection methods, common shadow AI patterns, and assessment of unmanaged AI risks.
System Prompt Extraction Techniques
Catalog of system prompt extraction methods against LLM-powered applications: direct attacks, indirect techniques, multi-turn strategies, and defensive evasion.
Sleeper Agent Detection Walkthrough
Walkthrough of detecting deceptive sleeper agent behaviors in fine-tuned language models.
LLM Watermark Analysis Walkthrough
Walkthrough of detecting and analyzing watermarks in LLM-generated text using statistical methods.
Behavioral Anomaly Detection for LLMs
Implement behavioral anomaly detection that identifies when model outputs deviate from expected safety profiles.
Canary Token Deployment
Step-by-step walkthrough for deploying canary tokens in LLM system prompts and context to detect prompt injection and data exfiltration attempts, covering token generation, placement strategies, monitoring, and alerting.
Hallucination Detection
Step-by-step walkthrough for detecting and flagging hallucinated content in LLM outputs, covering factual grounding checks, self-consistency verification, source attribution validation, and confidence scoring.
Prompt Classifier Training
Step-by-step walkthrough for training a machine learning classifier to detect malicious prompts, covering dataset curation, feature engineering, model selection, training pipeline, evaluation, and deployment as a real-time detection service.
ML-Based Prompt Injection Detection Systems
Walkthrough for building and deploying ML-based prompt injection detection systems, covering training data collection, feature engineering, model architecture selection, threshold tuning, production deployment, and continuous improvement.
Prompt Injection Honeypot Setup
Deploy honeypot prompts and canary data that detect and characterize prompt injection attempts.
Real-Time Attack Detection System
Build a real-time attack detection system that monitors LLM interactions for adversarial patterns.
Testing Prompt Injection Defenses with Rebuff
Walkthrough for using Rebuff to test and evaluate prompt injection detection capabilities, covering installation, detection pipeline analysis, adversarial evasion testing, custom rule development, and benchmarking detection accuracy.