AI Anomaly Detection

Intermediate9 min readUpdated 2026-03-15

Detecting jailbreak attempts, unusual usage patterns, output drift, and embedding space anomalies in AI systems through statistical and ML-based methods.

anomaly-detection jailbreak-detection drift embedding intermediate

The Anomaly 偵測 Problem for AI

Anomaly 偵測 in traditional software is relatively straightforward: define "normal" (response time under 200ms, error rate below 1%), and alert when observations deviate. For AI systems, "normal" is much harder to define 因為系統's outputs are inherently variable — two identical prompts can produce different responses, and a wide range of topics and conversation styles are all legitimate.

Detecting 越獄 Attempts

越獄偵測 is the most immediately actionable form of AI anomaly 偵測. Several complementary approaches can be combined:

Classifier-Based 偵測

Train a dedicated classifier to distinguish between benign and 對抗性 inputs.

Approach	Pros	Cons
Fine-tuned transformer	High accuracy on known patterns	Misses novel attacks; requires labeled 訓練資料
Zero-shot classification	No 訓練資料 needed; detects intent	Lower precision; high false positive rate
Ensemble classifier	Combines multiple signals for robustness	Higher latency; more complex deployment

Perplexity-Based 偵測

對抗性 inputs, particularly GCG-style suffixes and encoded payloads, often have abnormally high perplexity — 模型 finds them "surprising" 因為 they do not resemble natural language.

def detect_adversarial_by_perplexity(text: str, threshold: float = 100.0) -> bool:
    """Flag inputs with unusually high perplexity."""
    符元 = 分詞器.encode(text)
    with torch.no_grad():
        outputs = model(torch.tensor([符元]))
        log_probs = outputs.logits.log_softmax(dim=-1)
 
    # Calculate perplexity
    token_log_probs = [
        log_probs[0, i, 符元[i+1]].item()
        for i in range(len(符元) - 1)
    ]
    perplexity = math.exp(-sum(token_log_probs) / len(token_log_probs))
 
    return perplexity > threshold

Limitation: Perplexity-based 偵測 catches garbled 對抗性 strings but misses natural-language jailbreaks that use fluent, low-perplexity text.

Semantic Similarity 偵測

Compare incoming prompts against a 資料庫 of known 越獄 templates using 嵌入向量 similarity:

Compute 嵌入向量 for all known 越獄 prompts
對每個 incoming prompt, compute its 嵌入向量 and find the nearest neighbors in the 越獄資料庫
Flag prompts with cosine similarity above a threshold to any known 越獄

Limitation: Effective against variations of known attacks but cannot detect genuinely novel techniques.

Behavioral Pattern 偵測

Rather than classifying individual inputs, track behavioral patterns across a user session:

Pattern	Indicates	偵測 Method
Rapidly iterating on similar prompts	越獄 search	Edit distance between consecutive prompts
Escalating from benign to sensitive topics	Crescendo attack	Topic classification per turn
Frequent role-play or persona requests	Role-based 越獄	Intent classification for persona switching
Unusually long inputs	Context window 利用	輸入 length 監控
Encoded or obfuscated content	Evasion attempt	Entropy and character distribution analysis

Detecting Unusual Usage Patterns

Beyond individual 越獄 attempts, aggregate usage patterns reveal systematic 對抗性 activity.

User-Level Anomalies

Signal	Normal Range	Anomalous	Possible Cause
Requests per hour	5-50	500+	Automated scanning or extraction
Average 輸入 length	20-200 符元	2000+ 符元	Context stuffing or payload delivery
Average 輸出 length	50-500 符元	4000+ 符元	Extraction or data exfiltration
Unique topics per session	1-5	50+	Systematic probing
Refusal rate	0-5%	30%+	Active 越獄 attempt
Tool call ratio	10-30%	90%+	Tool abuse through injection

Temporal Anomalies

Off-hours activity: Concentrated usage during hours when 使用者 base is normally inactive
Burst patterns: Short periods of intense activity followed by long pauses (characteristic of automated tools running in batches)
Periodic patterns: Regular intervals between requests suggesting scripted access
Session length anomalies: Very long sessions or sessions with an unusual number of turns

Cohort-Based Anomalies

Compare individual users against their cohort rather than global baselines:

A developer using a code assistant at 200 requests/hour may be normal for developers but anomalous for customer support 代理
Topic distribution should be compared against users with similar roles or use cases
New accounts with behavior patterns matching known attack tools should be flagged

輸出 Drift 偵測

輸出 drift occurs when 模型's responses gradually change in character, quality, or 對齊. This can indicate model degradation, poisoned 微調, or RAG source corruption.

Types of 輸出 Drift

Drift Type	Description	偵測 Method
Quality drift	Outputs become less coherent or less accurate over time	Automated quality scoring (perplexity, BLEU, semantic coherence)
Alignment drift	Outputs become less aligned with 安全 policies	Periodic 安全 benchmark 評估
Topic drift	Outputs shift toward unexpected topics	Topic classifier applied to 輸出 distribution
Tone drift	輸出 sentiment or formality changes	Sentiment analysis over rolling windows
Factuality drift	Outputs contain more ungrounded claims	Fact-checking against 知識庫

Drift 偵測 Methods

Statistical tests: Apply distribution comparison tests (KL divergence, Jensen-Shannon divergence, Kolmogorov-Smirnov 測試) to 輸出 feature distributions over time windows.

Reference-based 監控: Maintain a "golden set" of expected responses to canonical queries. Periodically re-評估模型 against this golden set and alert on significant changes.

嵌入向量 drift: Track the centroid and variance of 輸出嵌入向量 over time. A shifting centroid indicates systematic drift; increasing variance indicates decreasing consistency.

嵌入向量 Space 監控

嵌入向量-based 監控 provides a powerful, representation-level view of system behavior:

輸入嵌入向量監控

Plot 輸入嵌入向量 over time to detect shifts in what users are asking about:

Cluster emergence: New clusters of inputs appearing that were not present historically may indicate coordinated 對抗性 activity
Cluster drift: Existing clusters shifting location may indicate gradual topic evolution or manipulation
Outlier 偵測: Inputs far from any existing cluster may be novel attacks

輸出嵌入向量監控

Track 輸出嵌入向量 to detect changes in what 模型 is producing:

Alignment corridor: Define an expected region of 嵌入向量 space for acceptable outputs. Outputs that fall outside this corridor may violate policy.
Forbidden regions: Define regions of 嵌入向量 space corresponding to known harmful content categories. Outputs near these regions trigger alerts.
Similarity to known-harmful content: Compute similarity between outputs and a 資料庫 of harmful examples.

RAG Source 監控

For 檢索增強生成 systems, monitor the 嵌入向量 space of retrieved documents:

Source contamination: New documents appearing in the retrieval corpus that cluster near 對抗性 content
Retrieval drift: Changes in which documents are most frequently retrieved
Injection 偵測: Documents that contain embedded instructions (high similarity to 提示詞注入 templates)

Building an Anomaly 偵測 Pipeline

Establish Baselines
Collect 2-4 weeks of normal operation data. Compute statistical baselines for all key metrics. 識別 natural patterns (daily cycles, weekly cycles, seasonal patterns).
Select 偵測 Methods
對每個 threat type, choose appropriate 偵測 methods. Combine multiple methods for 防禦 in depth. Prioritize methods based on the most likely attack vectors for your system.
Set Initial Thresholds
Start with conservative thresholds (low sensitivity, minimal false positives). Use historical data to calibrate 偵測 rates.
Deploy 監控
實作 real-time 監控 for critical signals and batch 監控 for trend analysis. Ensure 監控 does not introduce significant latency.
Tune and Iterate
Review alerts weekly. Adjust thresholds based on false positive/negative rates. Add new 偵測 rules as new attack patterns emerge. Remove rules that consistently produce false positives.

Evading Anomaly 偵測

Red teamers should 理解 how anomaly 偵測 can be circumvented:

Baseline pollution: Gradually introduce 對抗性 patterns during the baseline collection period so they become "normal"
Mimicry: Study normal user behavior and match your attack patterns to it (request timing, topic distribution, 輸入 length)
Distributed attacks: Use multiple accounts or IP addresses to keep per-user metrics within normal ranges
Gradual escalation: Change behavior slowly enough that drift 偵測 does not trigger (the "boiling frog" approach)
Noise injection: Mix 對抗性 requests with legitimate requests to dilute statistical signals

參考文獻

"Detecting LLM 越獄 Attempts Using Perplexity-Based Methods" - Jain et al. (2024) - Research on using perplexity as a signal for detecting 對抗性 LLM inputs
"嵌入向量 Space Anomaly 偵測 for LLM 安全" - Chen et al. (2025) - Methods for detecting harmful outputs through 嵌入向量 space 監控
"Data Drift 偵測 for ML Systems" - Huyen (2024) - Comprehensive overview of drift 偵測 methods applicable to AI 監控
"Baseline 防禦 for 對抗性攻擊 Against Aligned Language Models" - Alon & Kamfonas (2023) - Survey of 偵測 approaches for 對抗性 inputs to LLMs

Knowledge Check

Why is perplexity-based 偵測 effective against GCG-style 對抗性 suffixes but not natural-language jailbreaks?

AI Anomaly Detection

Intermediate9 min readUpdated 2026-03-15

Detecting jailbreak attempts, unusual usage patterns, output drift, and embedding space anomalies in AI systems through statistical and ML-based methods.

anomaly-detection jailbreak-detection drift embedding intermediate

The Anomaly 偵測 Problem for AI

Detecting 越獄 Attempts

越獄偵測 is the most immediately actionable form of AI anomaly 偵測. Several complementary approaches can be combined:

Classifier-Based 偵測

Train a dedicated classifier to distinguish between benign and 對抗性 inputs.

Approach	Pros	Cons
Fine-tuned transformer	High accuracy on known patterns	Misses novel attacks; requires labeled 訓練資料
Zero-shot classification	No 訓練資料 needed; detects intent	Lower precision; high false positive rate
Ensemble classifier	Combines multiple signals for robustness	Higher latency; more complex deployment

Perplexity-Based 偵測

對抗性 inputs, particularly GCG-style suffixes and encoded payloads, often have abnormally high perplexity — 模型 finds them "surprising" 因為 they do not resemble natural language.

def detect_adversarial_by_perplexity(text: str, threshold: float = 100.0) -> bool:
    """Flag inputs with unusually high perplexity."""
    符元 = 分詞器.encode(text)
    with torch.no_grad():
        outputs = model(torch.tensor([符元]))
        log_probs = outputs.logits.log_softmax(dim=-1)
 
    # Calculate perplexity
    token_log_probs = [
        log_probs[0, i, 符元[i+1]].item()
        for i in range(len(符元) - 1)
    ]
    perplexity = math.exp(-sum(token_log_probs) / len(token_log_probs))
 
    return perplexity > threshold

Limitation: Perplexity-based 偵測 catches garbled 對抗性 strings but misses natural-language jailbreaks that use fluent, low-perplexity text.

Semantic Similarity 偵測

Compare incoming prompts against a 資料庫 of known 越獄 templates using 嵌入向量 similarity:

Compute 嵌入向量 for all known 越獄 prompts
對每個 incoming prompt, compute its 嵌入向量 and find the nearest neighbors in the 越獄資料庫
Flag prompts with cosine similarity above a threshold to any known 越獄

Limitation: Effective against variations of known attacks but cannot detect genuinely novel techniques.

Behavioral Pattern 偵測

Rather than classifying individual inputs, track behavioral patterns across a user session:

Pattern	Indicates	偵測 Method
Rapidly iterating on similar prompts	越獄 search	Edit distance between consecutive prompts
Escalating from benign to sensitive topics	Crescendo attack	Topic classification per turn
Frequent role-play or persona requests	Role-based 越獄	Intent classification for persona switching
Unusually long inputs	Context window 利用	輸入 length 監控
Encoded or obfuscated content	Evasion attempt	Entropy and character distribution analysis

Detecting Unusual Usage Patterns

Beyond individual 越獄 attempts, aggregate usage patterns reveal systematic 對抗性 activity.

User-Level Anomalies

Signal	Normal Range	Anomalous	Possible Cause
Requests per hour	5-50	500+	Automated scanning or extraction
Average 輸入 length	20-200 符元	2000+ 符元	Context stuffing or payload delivery
Average 輸出 length	50-500 符元	4000+ 符元	Extraction or data exfiltration
Unique topics per session	1-5	50+	Systematic probing
Refusal rate	0-5%	30%+	Active 越獄 attempt
Tool call ratio	10-30%	90%+	Tool abuse through injection

Temporal Anomalies

Off-hours activity: Concentrated usage during hours when 使用者 base is normally inactive
Burst patterns: Short periods of intense activity followed by long pauses (characteristic of automated tools running in batches)
Periodic patterns: Regular intervals between requests suggesting scripted access
Session length anomalies: Very long sessions or sessions with an unusual number of turns

Cohort-Based Anomalies

Compare individual users against their cohort rather than global baselines:

A developer using a code assistant at 200 requests/hour may be normal for developers but anomalous for customer support 代理
Topic distribution should be compared against users with similar roles or use cases
New accounts with behavior patterns matching known attack tools should be flagged

輸出 Drift 偵測

輸出 drift occurs when 模型's responses gradually change in character, quality, or 對齊. This can indicate model degradation, poisoned 微調, or RAG source corruption.

Types of 輸出 Drift

Drift Type	Description	偵測 Method
Quality drift	Outputs become less coherent or less accurate over time	Automated quality scoring (perplexity, BLEU, semantic coherence)
Alignment drift	Outputs become less aligned with 安全 policies	Periodic 安全 benchmark 評估
Topic drift	Outputs shift toward unexpected topics	Topic classifier applied to 輸出 distribution
Tone drift	輸出 sentiment or formality changes	Sentiment analysis over rolling windows
Factuality drift	Outputs contain more ungrounded claims	Fact-checking against 知識庫

Drift 偵測 Methods

Statistical tests: Apply distribution comparison tests (KL divergence, Jensen-Shannon divergence, Kolmogorov-Smirnov 測試) to 輸出 feature distributions over time windows.

Reference-based 監控: Maintain a "golden set" of expected responses to canonical queries. Periodically re-評估模型 against this golden set and alert on significant changes.

嵌入向量 drift: Track the centroid and variance of 輸出嵌入向量 over time. A shifting centroid indicates systematic drift; increasing variance indicates decreasing consistency.

嵌入向量 Space 監控

嵌入向量-based 監控 provides a powerful, representation-level view of system behavior:

輸入嵌入向量監控

Plot 輸入嵌入向量 over time to detect shifts in what users are asking about:

Cluster emergence: New clusters of inputs appearing that were not present historically may indicate coordinated 對抗性 activity
Cluster drift: Existing clusters shifting location may indicate gradual topic evolution or manipulation
Outlier 偵測: Inputs far from any existing cluster may be novel attacks

輸出嵌入向量監控

Track 輸出嵌入向量 to detect changes in what 模型 is producing:

Alignment corridor: Define an expected region of 嵌入向量 space for acceptable outputs. Outputs that fall outside this corridor may violate policy.
Forbidden regions: Define regions of 嵌入向量 space corresponding to known harmful content categories. Outputs near these regions trigger alerts.
Similarity to known-harmful content: Compute similarity between outputs and a 資料庫 of harmful examples.

RAG Source 監控

For 檢索增強生成 systems, monitor the 嵌入向量 space of retrieved documents:

Source contamination: New documents appearing in the retrieval corpus that cluster near 對抗性 content
Retrieval drift: Changes in which documents are most frequently retrieved
Injection 偵測: Documents that contain embedded instructions (high similarity to 提示詞注入 templates)

Building an Anomaly 偵測 Pipeline

Establish Baselines
Collect 2-4 weeks of normal operation data. Compute statistical baselines for all key metrics. 識別 natural patterns (daily cycles, weekly cycles, seasonal patterns).
Select 偵測 Methods
對每個 threat type, choose appropriate 偵測 methods. Combine multiple methods for 防禦 in depth. Prioritize methods based on the most likely attack vectors for your system.
Set Initial Thresholds
Start with conservative thresholds (low sensitivity, minimal false positives). Use historical data to calibrate 偵測 rates.
Deploy 監控
實作 real-time 監控 for critical signals and batch 監控 for trend analysis. Ensure 監控 does not introduce significant latency.
Tune and Iterate
Review alerts weekly. Adjust thresholds based on false positive/negative rates. Add new 偵測 rules as new attack patterns emerge. Remove rules that consistently produce false positives.

Evading Anomaly 偵測

Red teamers should 理解 how anomaly 偵測 can be circumvented:

Baseline pollution: Gradually introduce 對抗性 patterns during the baseline collection period so they become "normal"
Mimicry: Study normal user behavior and match your attack patterns to it (request timing, topic distribution, 輸入 length)
Distributed attacks: Use multiple accounts or IP addresses to keep per-user metrics within normal ranges
Gradual escalation: Change behavior slowly enough that drift 偵測 does not trigger (the "boiling frog" approach)
Noise injection: Mix 對抗性 requests with legitimate requests to dilute statistical signals

參考文獻

"Detecting LLM 越獄 Attempts Using Perplexity-Based Methods" - Jain et al. (2024) - Research on using perplexity as a signal for detecting 對抗性 LLM inputs
"嵌入向量 Space Anomaly 偵測 for LLM 安全" - Chen et al. (2025) - Methods for detecting harmful outputs through 嵌入向量 space 監控
"Data Drift 偵測 for ML Systems" - Huyen (2024) - Comprehensive overview of drift 偵測 methods applicable to AI 監控
"Baseline 防禦 for 對抗性攻擊 Against Aligned Language Models" - Alon & Kamfonas (2023) - Survey of 偵測 approaches for 對抗性 inputs to LLMs

Knowledge Check

Why is perplexity-based 偵測 effective against GCG-style 對抗性 suffixes but not natural-language jailbreaks?

AI Anomaly Detection

Establish Baselines

Select 偵測 Methods

Set Initial Thresholds

Deploy 監控

Tune and Iterate

Related articles

AI Anomaly Detection

Establish Baselines

Select 偵測 Methods

Set Initial Thresholds

Deploy 監控

Tune and Iterate

Related articles