AI Anomaly Detection
Detecting jailbreak attempts, unusual usage patterns, output drift, and embedding space anomalies in AI systems through statistical and ML-based methods.
The Anomaly Detection Problem for AI
Anomaly detection in traditional software is relatively straightforward: define "normal" (response time under 200ms, error rate below 1%), and alert when observations deviate. For AI systems, "normal" is much harder to define because the system's outputs are inherently variable — two identical prompts can produce different responses, and a wide range of topics and conversation styles are all legitimate.
Detecting Jailbreak Attempts
Jailbreak detection is the most immediately actionable form of AI anomaly detection. Several complementary approaches can be combined:
Classifier-Based Detection
Train a dedicated classifier to distinguish between benign and adversarial inputs.
| Approach | Pros | Cons |
|---|---|---|
| Fine-tuned transformer | High accuracy on known patterns | Misses novel attacks; requires labeled training data |
| Zero-shot classification | No training data needed; detects intent | Lower precision; high false positive rate |
| Ensemble classifier | Combines multiple signals for robustness | Higher latency; more complex deployment |
Perplexity-Based Detection
Adversarial inputs, particularly GCG-style suffixes and encoded payloads, often have abnormally high perplexity — the model finds them "surprising" because they do not resemble natural language.
def detect_adversarial_by_perplexity(text: str, threshold: float = 100.0) -> bool:
"""Flag inputs with unusually high perplexity."""
tokens = tokenizer.encode(text)
with torch.no_grad():
outputs = model(torch.tensor([tokens]))
log_probs = outputs.logits.log_softmax(dim=-1)
# Calculate perplexity
token_log_probs = [
log_probs[0, i, tokens[i+1]].item()
for i in range(len(tokens) - 1)
]
perplexity = math.exp(-sum(token_log_probs) / len(token_log_probs))
return perplexity > thresholdLimitation: Perplexity-based detection catches garbled adversarial strings but misses natural-language jailbreaks that use fluent, low-perplexity text.
Semantic Similarity Detection
Compare incoming prompts against a database of known jailbreak templates using embedding similarity:
- Compute embeddings for all known jailbreak prompts
- For each incoming prompt, compute its embedding and find the nearest neighbors in the jailbreak database
- Flag prompts with cosine similarity above a threshold to any known jailbreak
Limitation: Effective against variations of known attacks but cannot detect genuinely novel techniques.
Behavioral Pattern Detection
Rather than classifying individual inputs, track behavioral patterns across a user session:
| Pattern | Indicates | Detection Method |
|---|---|---|
| Rapidly iterating on similar prompts | Jailbreak search | Edit distance between consecutive prompts |
| Escalating from benign to sensitive topics | Crescendo attack | Topic classification per turn |
| Frequent role-play or persona requests | Role-based jailbreak | Intent classification for persona switching |
| Unusually long inputs | Context window exploitation | Input length monitoring |
| Encoded or obfuscated content | Evasion attempt | Entropy and character distribution analysis |
Detecting Unusual Usage Patterns
Beyond individual jailbreak attempts, aggregate usage patterns reveal systematic adversarial activity.
User-Level Anomalies
| Signal | Normal Range | Anomalous | Possible Cause |
|---|---|---|---|
| Requests per hour | 5-50 | 500+ | Automated scanning or extraction |
| Average input length | 20-200 tokens | 2000+ tokens | Context stuffing or payload delivery |
| Average output length | 50-500 tokens | 4000+ tokens | Extraction or data exfiltration |
| Unique topics per session | 1-5 | 50+ | Systematic probing |
| Refusal rate | 0-5% | 30%+ | Active jailbreak attempt |
| Tool call ratio | 10-30% | 90%+ | Tool abuse through injection |
Temporal Anomalies
- Off-hours activity: Concentrated usage during hours when the user base is normally inactive
- Burst patterns: Short periods of intense activity followed by long pauses (characteristic of automated tools running in batches)
- Periodic patterns: Regular intervals between requests suggesting scripted access
- Session length anomalies: Very long sessions or sessions with an unusual number of turns
Cohort-Based Anomalies
Compare individual users against their cohort rather than global baselines:
- A developer using a code assistant at 200 requests/hour may be normal for developers but anomalous for customer support agents
- Topic distribution should be compared against users with similar roles or use cases
- New accounts with behavior patterns matching known attack tools should be flagged
Output Drift Detection
Output drift occurs when the model's responses gradually change in character, quality, or alignment. This can indicate model degradation, poisoned fine-tuning, or RAG source corruption.
Types of Output Drift
| Drift Type | Description | Detection Method |
|---|---|---|
| Quality drift | Outputs become less coherent or less accurate over time | Automated quality scoring (perplexity, BLEU, semantic coherence) |
| Alignment drift | Outputs become less aligned with safety policies | Periodic safety benchmark evaluation |
| Topic drift | Outputs shift toward unexpected topics | Topic classifier applied to output distribution |
| Tone drift | Output sentiment or formality changes | Sentiment analysis over rolling windows |
| Factuality drift | Outputs contain more ungrounded claims | Fact-checking against knowledge base |
Drift Detection Methods
Statistical tests: Apply distribution comparison tests (KL divergence, Jensen-Shannon divergence, Kolmogorov-Smirnov test) to output feature distributions over time windows.
Reference-based monitoring: Maintain a "golden set" of expected responses to canonical queries. Periodically re-evaluate the model against this golden set and alert on significant changes.
Embedding drift: Track the centroid and variance of output embeddings over time. A shifting centroid indicates systematic drift; increasing variance indicates decreasing consistency.
Embedding Space Monitoring
Embedding-based monitoring provides a powerful, representation-level view of system behavior:
Input Embedding Monitoring
Plot input embeddings over time to detect shifts in what users are asking about:
- Cluster emergence: New clusters of inputs appearing that were not present historically may indicate coordinated adversarial activity
- Cluster drift: Existing clusters shifting location may indicate gradual topic evolution or manipulation
- Outlier detection: Inputs far from any existing cluster may be novel attacks
Output Embedding Monitoring
Track output embeddings to detect changes in what the model is producing:
- Alignment corridor: Define an expected region of embedding space for acceptable outputs. Outputs that fall outside this corridor may violate policy.
- Forbidden regions: Define regions of embedding space corresponding to known harmful content categories. Outputs near these regions trigger alerts.
- Similarity to known-harmful content: Compute similarity between outputs and a database of harmful examples.
RAG Source Monitoring
For retrieval-augmented generation systems, monitor the embedding space of retrieved documents:
- Source contamination: New documents appearing in the retrieval corpus that cluster near adversarial content
- Retrieval drift: Changes in which documents are most frequently retrieved
- Injection detection: Documents that contain embedded instructions (high similarity to prompt injection templates)
Building an Anomaly Detection Pipeline
Establish Baselines
Collect 2-4 weeks of normal operation data. Compute statistical baselines for all key metrics. Identify natural patterns (daily cycles, weekly cycles, seasonal patterns).
Select Detection Methods
For each threat type, choose appropriate detection methods. Combine multiple methods for defense in depth. Prioritize methods based on the most likely attack vectors for your system.
Set Initial Thresholds
Start with conservative thresholds (low sensitivity, minimal false positives). Use historical data to calibrate detection rates.
Deploy Monitoring
Implement real-time monitoring for critical signals and batch monitoring for trend analysis. Ensure monitoring does not introduce significant latency.
Tune and Iterate
Review alerts weekly. Adjust thresholds based on false positive/negative rates. Add new detection rules as new attack patterns emerge. Remove rules that consistently produce false positives.
Evading Anomaly Detection
Red teamers should understand how anomaly detection can be circumvented:
- Baseline pollution: Gradually introduce adversarial patterns during the baseline collection period so they become "normal"
- Mimicry: Study normal user behavior and match your attack patterns to it (request timing, topic distribution, input length)
- Distributed attacks: Use multiple accounts or IP addresses to keep per-user metrics within normal ranges
- Gradual escalation: Change behavior slowly enough that drift detection does not trigger (the "boiling frog" approach)
- Noise injection: Mix adversarial requests with legitimate requests to dilute statistical signals
Related Topics
- AI Monitoring and Observability — the broader monitoring architecture
- Logging Architecture — capturing the data anomaly detection needs
- Input/Output Filtering — preventive controls that anomaly detection complements
- Prompt Shields — inline detection approaches
References
- "Detecting LLM Jailbreak Attempts Using Perplexity-Based Methods" - Jain et al. (2024) - Research on using perplexity as a signal for detecting adversarial LLM inputs
- "Embedding Space Anomaly Detection for LLM Safety" - Chen et al. (2025) - Methods for detecting harmful outputs through embedding space monitoring
- "Data Drift Detection for ML Systems" - Huyen (2024) - Comprehensive overview of drift detection methods applicable to AI monitoring
- "Baseline Defenses for Adversarial Attacks Against Aligned Language Models" - Alon & Kamfonas (2023) - Survey of detection approaches for adversarial inputs to LLMs
Why is perplexity-based detection effective against GCG-style adversarial suffixes but not natural-language jailbreaks?