AI Anomaly Detection
Detecting jailbreak attempts, unusual usage patterns, output drift, and embedding space anomalies in AI systems through statistical and ML-based methods.
The Anomaly 偵測 Problem for AI
Anomaly 偵測 in traditional software is relatively straightforward: define "normal" (response time under 200ms, error rate below 1%), and alert when observations deviate. For AI systems, "normal" is much harder to define 因為 系統's outputs are inherently variable — two identical prompts can produce different responses, and a wide range of topics and conversation styles are all legitimate.
Detecting 越獄 Attempts
越獄 偵測 is the most immediately actionable form of AI anomaly 偵測. Several complementary approaches can be combined:
Classifier-Based 偵測
Train a dedicated classifier to distinguish between benign and 對抗性 inputs.
| Approach | Pros | Cons |
|---|---|---|
| Fine-tuned transformer | High accuracy on known patterns | Misses novel attacks; requires labeled 訓練資料 |
| Zero-shot classification | No 訓練資料 needed; detects intent | Lower precision; high false positive rate |
| Ensemble classifier | Combines multiple signals for robustness | Higher latency; more complex deployment |
Perplexity-Based 偵測
對抗性 inputs, particularly GCG-style suffixes and encoded payloads, often have abnormally high perplexity — 模型 finds them "surprising" 因為 they do not resemble natural language.
def detect_adversarial_by_perplexity(text: str, threshold: float = 100.0) -> bool:
"""Flag inputs with unusually high perplexity."""
符元 = 分詞器.encode(text)
with torch.no_grad():
outputs = model(torch.tensor([符元]))
log_probs = outputs.logits.log_softmax(dim=-1)
# Calculate perplexity
token_log_probs = [
log_probs[0, i, 符元[i+1]].item()
for i in range(len(符元) - 1)
]
perplexity = math.exp(-sum(token_log_probs) / len(token_log_probs))
return perplexity > thresholdLimitation: Perplexity-based 偵測 catches garbled 對抗性 strings but misses natural-language jailbreaks that use fluent, low-perplexity text.
Semantic Similarity 偵測
Compare incoming prompts against a 資料庫 of known 越獄 templates using 嵌入向量 similarity:
- Compute 嵌入向量 for all known 越獄 prompts
- 對每個 incoming prompt, compute its 嵌入向量 and find the nearest neighbors in the 越獄 資料庫
- Flag prompts with cosine similarity above a threshold to any known 越獄
Limitation: Effective against variations of known attacks but cannot detect genuinely novel techniques.
Behavioral Pattern 偵測
Rather than classifying individual inputs, track behavioral patterns across a user session:
| Pattern | Indicates | 偵測 Method |
|---|---|---|
| Rapidly iterating on similar prompts | 越獄 search | Edit distance between consecutive prompts |
| Escalating from benign to sensitive topics | Crescendo attack | Topic classification per turn |
| Frequent role-play or persona requests | Role-based 越獄 | Intent classification for persona switching |
| Unusually long inputs | Context window 利用 | 輸入 length 監控 |
| Encoded or obfuscated content | Evasion attempt | Entropy and character distribution analysis |
Detecting Unusual Usage Patterns
Beyond individual 越獄 attempts, aggregate usage patterns reveal systematic 對抗性 activity.
User-Level Anomalies
| Signal | Normal Range | Anomalous | Possible Cause |
|---|---|---|---|
| Requests per hour | 5-50 | 500+ | Automated scanning or extraction |
| Average 輸入 length | 20-200 符元 | 2000+ 符元 | Context stuffing or payload delivery |
| Average 輸出 length | 50-500 符元 | 4000+ 符元 | Extraction or data exfiltration |
| Unique topics per session | 1-5 | 50+ | Systematic probing |
| Refusal rate | 0-5% | 30%+ | Active 越獄 attempt |
| Tool call ratio | 10-30% | 90%+ | Tool abuse through injection |
Temporal Anomalies
- Off-hours activity: Concentrated usage during hours when 使用者 base is normally inactive
- Burst patterns: Short periods of intense activity followed by long pauses (characteristic of automated tools running in batches)
- Periodic patterns: Regular intervals between requests suggesting scripted access
- Session length anomalies: Very long sessions or sessions with an unusual number of turns
Cohort-Based Anomalies
Compare individual users against their cohort rather than global baselines:
- A developer using a code assistant at 200 requests/hour may be normal for developers but anomalous for customer support 代理
- Topic distribution should be compared against users with similar roles or use cases
- New accounts with behavior patterns matching known attack tools should be flagged
輸出 Drift 偵測
輸出 drift occurs when 模型's responses gradually change in character, quality, or 對齊. This can indicate model degradation, poisoned 微調, or RAG source corruption.
Types of 輸出 Drift
| Drift Type | Description | 偵測 Method |
|---|---|---|
| Quality drift | Outputs become less coherent or less accurate over time | Automated quality scoring (perplexity, BLEU, semantic coherence) |
| Alignment drift | Outputs become less aligned with 安全 policies | Periodic 安全 benchmark 評估 |
| Topic drift | Outputs shift toward unexpected topics | Topic classifier applied to 輸出 distribution |
| Tone drift | 輸出 sentiment or formality changes | Sentiment analysis over rolling windows |
| Factuality drift | Outputs contain more ungrounded claims | Fact-checking against 知識庫 |
Drift 偵測 Methods
Statistical tests: Apply distribution comparison tests (KL divergence, Jensen-Shannon divergence, Kolmogorov-Smirnov 測試) to 輸出 feature distributions over time windows.
Reference-based 監控: Maintain a "golden set" of expected responses to canonical queries. Periodically re-評估 模型 against this golden set and alert on significant changes.
嵌入向量 drift: Track the centroid and variance of 輸出 嵌入向量 over time. A shifting centroid indicates systematic drift; increasing variance indicates decreasing consistency.
嵌入向量 Space 監控
嵌入向量-based 監控 provides a powerful, representation-level view of system behavior:
輸入 嵌入向量 監控
Plot 輸入 嵌入向量 over time to detect shifts in what users are asking about:
- Cluster emergence: New clusters of inputs appearing that were not present historically may indicate coordinated 對抗性 activity
- Cluster drift: Existing clusters shifting location may indicate gradual topic evolution or manipulation
- Outlier 偵測: Inputs far from any existing cluster may be novel attacks
輸出 嵌入向量 監控
Track 輸出 嵌入向量 to detect changes in what 模型 is producing:
- Alignment corridor: Define an expected region of 嵌入向量 space for acceptable outputs. Outputs that fall outside this corridor may violate policy.
- Forbidden regions: Define regions of 嵌入向量 space corresponding to known harmful content categories. Outputs near these regions trigger alerts.
- Similarity to known-harmful content: Compute similarity between outputs and a 資料庫 of harmful examples.
RAG Source 監控
For 檢索增強生成 systems, monitor the 嵌入向量 space of retrieved documents:
- Source contamination: New documents appearing in the retrieval corpus that cluster near 對抗性 content
- Retrieval drift: Changes in which documents are most frequently retrieved
- Injection 偵測: Documents that contain embedded instructions (high similarity to 提示詞注入 templates)
Building an Anomaly 偵測 Pipeline
Establish Baselines
Collect 2-4 weeks of normal operation data. Compute statistical baselines for all key metrics. 識別 natural patterns (daily cycles, weekly cycles, seasonal patterns).
Select 偵測 Methods
對每個 threat type, choose appropriate 偵測 methods. Combine multiple methods for 防禦 in depth. Prioritize methods based on the most likely attack vectors for your system.
Set Initial Thresholds
Start with conservative thresholds (low sensitivity, minimal false positives). Use historical data to calibrate 偵測 rates.
Deploy 監控
實作 real-time 監控 for critical signals and batch 監控 for trend analysis. Ensure 監控 does not introduce significant latency.
Tune and Iterate
Review alerts weekly. Adjust thresholds based on false positive/negative rates. Add new 偵測 rules as new attack patterns emerge. Remove rules that consistently produce false positives.
Evading Anomaly 偵測
Red teamers should 理解 how anomaly 偵測 can be circumvented:
- Baseline pollution: Gradually introduce 對抗性 patterns during the baseline collection period so they become "normal"
- Mimicry: Study normal user behavior and match your attack patterns to it (request timing, topic distribution, 輸入 length)
- Distributed attacks: Use multiple accounts or IP addresses to keep per-user metrics within normal ranges
- Gradual escalation: Change behavior slowly enough that drift 偵測 does not trigger (the "boiling frog" approach)
- Noise injection: Mix 對抗性 requests with legitimate requests to dilute statistical signals
相關主題
- AI 監控 and Observability — the broader 監控 architecture
- Logging Architecture — capturing the data anomaly 偵測 needs
- 輸入/輸出 Filtering — preventive controls that anomaly 偵測 complements
- Prompt Shields — inline 偵測 approaches
參考文獻
- "Detecting LLM 越獄 Attempts Using Perplexity-Based Methods" - Jain et al. (2024) - Research on using perplexity as a signal for detecting 對抗性 LLM inputs
- "嵌入向量 Space Anomaly 偵測 for LLM 安全" - Chen et al. (2025) - Methods for detecting harmful outputs through 嵌入向量 space 監控
- "Data Drift 偵測 for ML Systems" - Huyen (2024) - Comprehensive overview of drift 偵測 methods applicable to AI 監控
- "Baseline 防禦 for 對抗性 攻擊 Against Aligned Language Models" - Alon & Kamfonas (2023) - Survey of 偵測 approaches for 對抗性 inputs to LLMs
Why is perplexity-based 偵測 effective against GCG-style 對抗性 suffixes but not natural-language jailbreaks?