# detection

frontier-researchalignment-fakingdetectionsafety

對齊偽裝偵測

偵測模型是否在訓練時偽裝對齊、實際部署時卻表現出另一套行為的研究技術。

frontier-researchsandbaggingcapability-evaluationdetection

藏拙偵測

偵測模型刻意隱藏能力(藏拙)的方法。

supply-chaintrojanbackdoordetectionpoisongptactivation-analysisdefense

特洛伊模型偵測

偵測特洛伊(後門)模型的技術,包含激活分析、觸發器搜尋與模型反演。

researchdetectionclassificationdefense

注入偵測研究

注入偵測的前沿研究,包括困惑度法、分類器方法與集成技術。

labbackdoordetectionforensicsfine-tuning

Lab: Backdoor Detection in Fine-Tuned Models

分析 a fine-tuned language model to find and characterize an inserted 後門, using behavioral probing, activation analysis, and statistical testing 技術.

labsneural-trojandetectionadvanced

Lab: Neural Network Trojan Detection

實作 and test neural network trojan detection methods including activation clustering and spectral analysis.

labswatermarkdetectionremovaladvanced

LLM 浮水印偵測 and 移除

Detect與remove statistical watermarks from LLM-generated text while preserving content quality.

labshallucinationdetectionbeginner

實作：幻覺偵測基礎

學習 to detect and trigger hallucinations in LLM outputs including factual errors, fabricated citations, and invented APIs.

入門

Lab: Detecting Output Filters

Identify and characterize output filtering mechanisms in LLM applications through systematic response analysis.

labsoutput-filteringdetectionbeginner

入門

CTF：Fine-Tune 偵探

透過行為分析、權重檢視與激活模式檢查，偵測微調語言模型中的後門。練習於部署前辨識被汙染模型所需的鑑識技術。

ctffine-tuningbackdoordetectionadvanced

labsctfsleeper-agentdetection

Time Bomb Defusal: Sleeper 代理偵測

Detect與neutralize a sleeper agent behavior trigger hidden in a fine-tuned model before it activates.

labexperthoneypotdeceptiondetectionhands-on

Lab: Deploy Honeypot AI

建構 and deploy a decoy AI system designed to detect, analyze, and characterize attackers targeting AI applications. 學習 honeypot design, telemetry collection, attacker profiling, and threat intelligence generation for AI-specific threats.

labsalignment-fakingdetectionexpert

對齊 Faking 偵測實驗室

實作detection methods for alignment faking behaviors where models behave differently during evaluation versus deployment.

labexpertwatermarkingdetectionhands-on

Lab: AI Watermark Detection & Removal

動手實作 exploring 技術 for detecting and removing statistical 浮水印s embedded in AI-generated text, and evaluating 浮水印 robustness.

labsguardrailslatencydetectionintermediate

護欄 Latency-Based 偵測

Use timing side channels to identify與characterize guardrail implementations in LLM applications.

llmopsmonitoringendpointsdetection

端點監控策略

為模型服務端點建置完整監控，以即時偵測攻擊、異常與漂移。

llmopsshadow-modeldetectionunauthorized

影子模型偵測

偵測並防範繞過安全控制與合規要求的未授權影子模型部署。

shadow-aiunauthorizeddetectiongovernancerisk

Shadow AI 偵測

找出組織中未授權 AI 部署：偵測方法、常見 shadow AI 模式，以及對未受管理 AI 風險之評估。

system-promptextractionprompt-injectionautomationdetectiontradecraft

系統提示擷取技術

針對 LLM 應用之系統提示擷取方法的目錄：直接攻擊、間接技術、多輪策略與規避偵測。

walkthroughssleeper-agentsdetectionalignment

Sleeper Agent Detection 詳解

Walkthrough of detecting deceptive sleeper agent behaviors in fine-tuned language models.

walkthroughswatermarkanalysisdetection

LLM Watermark Analysis 詳解

Walkthrough of detecting and analyzing watermarks in LLM-generated text using statistical methods.

walkthroughsdefensebehavioral-anomalydetection

LLM 的行為異常偵測

實作行為異常偵測，辨識模型輸出何時偏離預期的安全樣貌。

canary-tokensprompt-injectiondetectionmonitoringdefensewalkthrough

Canary Token Deployment

Step-by-step walkthrough for deploying canary tokens in LLM system prompts and context to detect prompt injection and data exfiltration attempts, covering token generation, placement strategies, monitoring, and alerting.

hallucinationdetectionfactual-groundingoutput-filteringdefensewalkthrough

Hallucination Detection

Step-by-step walkthrough for detecting and flagging hallucinated content in LLM outputs, covering factual grounding checks, self-consistency verification, source attribution validation, and confidence scoring.

classifiermachine-learningprompt-injectiondetectiontrainingdefensewalkthrough

Prompt Classifier 訓練

Step-by-step walkthrough for training a machine learning classifier to detect malicious prompts, covering dataset curation, feature engineering, model selection, training pipeline, evaluation, and deployment as a real-time detection service.

prompt-injectionmachine-learningdetectionclassifierdefensewalkthrough

ML-Based 提示詞注入 Detection Systems

導覽 for building and deploying ML-based prompt injection detection systems, covering training data collection, feature engineering, model architecture selection, threshold tuning, production deployment, and continuous improvement.

walkthroughsdefensehoneypotdetection

提示注入蜜罐部署

部署蜜罐提示與金絲雀資料，以偵測並刻畫提示注入嘗試。

defenserealtimedetectionattackwalkthroughs

即時攻擊偵測系統詳解

Build a real-time attack detection system that monitors LLM interactions for adversarial patterns.

rebuffprompt-injectiondetectiondefense-testingevasionwalkthrough

Testing 提示詞注入防禦s with Rebuff

導覽 for using Rebuff to test and evaluate prompt injection detection capabilities, covering installation, detection pipeline analysis, adversarial evasion testing, custom rule development, and benchmarking detection accuracy.