# detection

forensicslogsinjectiondetection

Log Analysis for Injection Detection

Analyzing application and model logs to detect prompt injection attacks including pattern matching, anomaly detection, and behavioral indicators.

backdoordetectionfine-tuningmodel-security

Backdoor Detection in Fine-Tuned Models

Detecting backdoors in fine-tuned AI models: activation analysis, trigger scanning techniques, behavioral probing strategies, and statistical methods for identifying hidden malicious functionality.

forensicstamperingdetectionintegrity

Model Tampering Detection (Ai Forensics Ir)

Detecting unauthorized modifications to model weights, configurations, and serving infrastructure through integrity verification and behavioral analysis.

assessmentmonitoringdetectionlogginganomaly-detectionincident-response

Monitoring & Detection Assessment

Test your understanding of AI security monitoring, anomaly detection, logging strategies, and incident detection for LLM-based applications with 9 intermediate-level questions.

capstoneincident-responsemonitoringdetectionsiem

Capstone: Build an AI Incident Response System

Design and implement an incident response system purpose-built for AI security incidents including prompt injection breaches, model manipulation, and data exfiltration through LLM applications.

capstoneprompt-injectionscannerdetectionml

Capstone: Build a Prompt Injection Detection Scanner

Build a production-grade prompt injection scanner that combines static analysis, ML classification, and runtime monitoring to detect injection attacks across LLM applications.

deepfakessynthetic-mediadetectiondisinformationfraud

Deepfake Incidents and Detection

Analysis of significant deepfake incidents including political disinformation, financial fraud, non-consensual content, and corporate impersonation. Covers detection techniques, defensive technologies, and the evolving adversarial landscape.

cloud-ai-securityloggingmonitoringdetectionobservability

Logging and Monitoring for Cloud AI Services

Implementing comprehensive logging and monitoring for cloud AI services including prompt/response capture, anomaly detection, and security-focused observability across AWS, Azure, and GCP.

communitychallengedefensedetection

Defense Challenge: Detection Engineering

Challenge focused on building detection systems for prompt injection, with scoring based on true positive rate and false positive rate.

watermarkingprovenancedetectionattackstext-watermark

AI Watermarking and Attacks

Current AI watermarking schemes for model outputs and training data, their security properties, and known attacks that remove, forge, or evade watermarks.

data-trainingwatermarkdetectionevasion

Training Data Watermark Attacks

Attacking and evading watermarking schemes designed to detect training data usage and enforce data licensing compliance.

watermarkingdetectionai-generated

Watermarking & AI-Generated Text Detection

Statistical watermarking schemes for LLM outputs, AI-generated text detectors, their cryptographic foundations, and systematic techniques for evading or removing watermarks.

defensecanary-tokensdetection

Canary Tokens for Injection Detection

Implementing canary token systems that detect prompt injection by monitoring for canary leakage in model outputs.

LLM Honeypot Techniques

Deploying honeypot prompts and trap mechanisms to detect and characterize adversarial probing of LLM systems.

defensehoneypotdetection

watermarkingprovenanceoutput-trackingaccountabilitydetection

Watermarking LLM Outputs for Provenance

Advanced techniques for watermarking LLM-generated text to establish provenance, including deployment architectures, multi-bit encoding schemes, robustness considerations, and the role of watermarking in AI security and accountability frameworks.

defensecanary-wordsmonitoringdetection

Canary Word Monitoring Systems

Deploying canary words in system prompts and documents to detect and alert on prompt injection and leakage.

defenseintent-classificationsafetydetection

User Intent Classification for Safety

Building user intent classifiers that distinguish legitimate requests from adversarial manipulation attempts.

embeddingwatermarkingdetectionevasion

Embedding Watermarking Attacks

Attacking and evading embedding watermarking schemes used for content tracking and intellectual property protection.

alignment-fakingdetectioninterpretabilitybehavioral-testingai-safetyevaluation

Alignment Faking Detection Methods

Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.

frontier-researchalignment-fakingdetectionsafety

Alignment Faking Detection

Detecting when models fake alignment during evaluation while exhibiting different behavior in deployment.

frontier-researchsandbaggingcapability-evaluationdetection

Sandbagging Detection in Capability Evaluations

Detecting when AI models deliberately underperform on capability evaluations to appear less capable.

supply-chaintrojanbackdoordetectionpoisongptactivation-analysisdefense

Trojan Model Detection

Defense-focused guide to detecting backdoored and trojan AI models, covering BadEdit, TrojanPuzzle, PoisonGPT techniques and practical detection methods including activation analysis, weight inspection, and behavioral testing.

researchdetectionclassificationdefense

Injection Detection Research

State-of-the-art research in injection detection including perplexity-based methods, classifier approaches, and ensemble techniques.

labbackdoordetectionforensicsfine-tuning

Lab: Backdoor Detection in Fine-Tuned Models

Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.

labsneural-trojandetectionadvanced

Lab: Neural Network Trojan Detection

Implement and test neural network trojan detection methods including activation clustering and spectral analysis.

labswatermarkdetectionremovaladvanced

LLM Watermark Detection and Removal

Detect and remove statistical watermarks from LLM-generated text while preserving content quality.

labshallucinationdetectionbeginner

Lab: Hallucination Detection Basics

Learn to detect and trigger hallucinations in LLM outputs including factual errors, fabricated citations, and invented APIs.

Beginner

Lab: Detecting Output Filters

Identify and characterize output filtering mechanisms in LLM applications through systematic response analysis.

labsoutput-filteringdetectionbeginner

Beginner

CTF: Fine-Tune Detective

Detect backdoors in fine-tuned language models through behavioral analysis, weight inspection, and activation pattern examination. Practice the forensic techniques needed to identify compromised models before deployment.

ctffine-tuningbackdoordetectionadvanced

labsctfsleeper-agentdetection

Time Bomb Defusal: Sleeper Agent Detection

Detect and neutralize a sleeper agent behavior trigger hidden in a fine-tuned model before it activates.

labexperthoneypotdeceptiondetectionhands-on

Lab: Deploy Honeypot AI

Build and deploy a decoy AI system designed to detect, analyze, and characterize attackers targeting AI applications. Learn honeypot design, telemetry collection, attacker profiling, and threat intelligence generation for AI-specific threats.

labsalignment-fakingdetectionexpert

Alignment Faking Detection Lab

Implement detection methods for alignment faking behaviors where models behave differently during evaluation versus deployment.

labexpertwatermarkingdetectionhands-on

Lab: AI Watermark Detection & Removal

Hands-on lab exploring techniques for detecting and removing statistical watermarks embedded in AI-generated text, and evaluating watermark robustness.

labsguardrailslatencydetectionintermediate

Guardrail Latency-Based Detection

Use timing side channels to identify and characterize guardrail implementations in LLM applications.

llmopsmonitoringendpointsdetection

Endpoint Monitoring Strategies

Implementing comprehensive monitoring for model serving endpoints to detect attacks, anomalies, and drift in real-time.

llmopsshadow-modeldetectionunauthorized

Shadow Model Detection

Detecting and preventing unauthorized shadow model deployments that bypass security controls and compliance requirements.

shadow-aiunauthorizeddetectiongovernancerisk

Shadow AI Detection

Finding unauthorized AI deployments in organizations: detection methods, common shadow AI patterns, and assessment of unmanaged AI risks.

system-promptextractionprompt-injectionautomationdetectiontradecraft

System Prompt Extraction Techniques

Catalog of system prompt extraction methods against LLM-powered applications: direct attacks, indirect techniques, multi-turn strategies, and defensive evasion.

walkthroughssleeper-agentsdetectionalignment

Sleeper Agent Detection Walkthrough

Walkthrough of detecting deceptive sleeper agent behaviors in fine-tuned language models.

walkthroughswatermarkanalysisdetection

LLM Watermark Analysis Walkthrough

Walkthrough of detecting and analyzing watermarks in LLM-generated text using statistical methods.

walkthroughsdefensebehavioral-anomalydetection

Behavioral Anomaly Detection for LLMs

Implement behavioral anomaly detection that identifies when model outputs deviate from expected safety profiles.

canary-tokensprompt-injectiondetectionmonitoringdefensewalkthrough

Canary Token Deployment

Step-by-step walkthrough for deploying canary tokens in LLM system prompts and context to detect prompt injection and data exfiltration attempts, covering token generation, placement strategies, monitoring, and alerting.

hallucinationdetectionfactual-groundingoutput-filteringdefensewalkthrough

Hallucination Detection

Step-by-step walkthrough for detecting and flagging hallucinated content in LLM outputs, covering factual grounding checks, self-consistency verification, source attribution validation, and confidence scoring.

classifiermachine-learningprompt-injectiondetectiontrainingdefensewalkthrough

Prompt Classifier Training

Step-by-step walkthrough for training a machine learning classifier to detect malicious prompts, covering dataset curation, feature engineering, model selection, training pipeline, evaluation, and deployment as a real-time detection service.

prompt-injectionmachine-learningdetectionclassifierdefensewalkthrough

ML-Based Prompt Injection Detection Systems

Walkthrough for building and deploying ML-based prompt injection detection systems, covering training data collection, feature engineering, model architecture selection, threshold tuning, production deployment, and continuous improvement.

walkthroughsdefensehoneypotdetection

Prompt Injection Honeypot Setup

Deploy honeypot prompts and canary data that detect and characterize prompt injection attempts.

defenserealtimedetectionattackwalkthroughs

Real-Time Attack Detection System

Build a real-time attack detection system that monitors LLM interactions for adversarial patterns.

rebuffprompt-injectiondetectiondefense-testingevasionwalkthrough

Testing Prompt Injection Defenses with Rebuff

Walkthrough for using Rebuff to test and evaluate prompt injection detection capabilities, covering installation, detection pipeline analysis, adversarial evasion testing, custom rule development, and benchmarking detection accuracy.