AI Anomaly Detection

intermediate10 min readUpdated 2026-03-15

Detecting jailbreak attempts, unusual usage patterns, output drift, and embedding space anomalies in AI systems through statistical and ML-based methods.

anomaly-detection jailbreak-detection drift embedding intermediate

The Anomaly Detection Problem for AI

Anomaly detection in traditional software is relatively straightforward: define "normal" (response time under 200ms, error rate below 1%), and alert when observations deviate. For AI systems, "normal" is much harder to define because the system's outputs are inherently variable — two identical prompts can produce different responses, and a wide range of topics and conversation styles are all legitimate.

Detecting Jailbreak Attempts

Jailbreak detection is the most immediately actionable form of AI anomaly detection. Several complementary approaches can be combined:

Classifier-Based Detection

Train a dedicated classifier to distinguish between benign and adversarial inputs.

Approach	Pros	Cons
Fine-tuned transformer	High accuracy on known patterns	Misses novel attacks; requires labeled training data
Zero-shot classification	No training data needed; detects intent	Lower precision; high false positive rate
Ensemble classifier	Combines multiple signals for robustness	Higher latency; more complex deployment

Perplexity-Based Detection

Adversarial inputs, particularly GCG-style suffixes and encoded payloads, often have abnormally high perplexity — the model finds them "surprising" because they do not resemble natural language.

def detect_adversarial_by_perplexity(text: str, threshold: float = 100.0) -> bool:
    """Flag inputs with unusually high perplexity."""
    tokens = tokenizer.encode(text)
    with torch.no_grad():
        outputs = model(torch.tensor([tokens]))
        log_probs = outputs.logits.log_softmax(dim=-1)
 
    # Calculate perplexity
    token_log_probs = [
        log_probs[0, i, tokens[i+1]].item()
        for i in range(len(tokens) - 1)
    ]
    perplexity = math.exp(-sum(token_log_probs) / len(token_log_probs))
 
    return perplexity > threshold

Limitation: Perplexity-based detection catches garbled adversarial strings but misses natural-language jailbreaks that use fluent, low-perplexity text.

Semantic Similarity Detection

Compare incoming prompts against a database of known jailbreak templates using embedding similarity:

Compute embeddings for all known jailbreak prompts
For each incoming prompt, compute its embedding and find the nearest neighbors in the jailbreak database
Flag prompts with cosine similarity above a threshold to any known jailbreak

Limitation: Effective against variations of known attacks but cannot detect genuinely novel techniques.

Behavioral Pattern Detection

Rather than classifying individual inputs, track behavioral patterns across a user session:

Pattern	Indicates	Detection Method
Rapidly iterating on similar prompts	Jailbreak search	Edit distance between consecutive prompts
Escalating from benign to sensitive topics	Crescendo attack	Topic classification per turn
Frequent role-play or persona requests	Role-based jailbreak	Intent classification for persona switching
Unusually long inputs	Context window exploitation	Input length monitoring
Encoded or obfuscated content	Evasion attempt	Entropy and character distribution analysis

Detecting Unusual Usage Patterns

Beyond individual jailbreak attempts, aggregate usage patterns reveal systematic adversarial activity.

User-Level Anomalies

Signal	Normal Range	Anomalous	Possible Cause
Requests per hour	5-50	500+	Automated scanning or extraction
Average input length	20-200 tokens	2000+ tokens	Context stuffing or payload delivery
Average output length	50-500 tokens	4000+ tokens	Extraction or data exfiltration
Unique topics per session	1-5	50+	Systematic probing
Refusal rate	0-5%	30%+	Active jailbreak attempt
Tool call ratio	10-30%	90%+	Tool abuse through injection

Temporal Anomalies

Off-hours activity: Concentrated usage during hours when the user base is normally inactive
Burst patterns: Short periods of intense activity followed by long pauses (characteristic of automated tools running in batches)
Periodic patterns: Regular intervals between requests suggesting scripted access
Session length anomalies: Very long sessions or sessions with an unusual number of turns

Cohort-Based Anomalies

Compare individual users against their cohort rather than global baselines:

A developer using a code assistant at 200 requests/hour may be normal for developers but anomalous for customer support agents
Topic distribution should be compared against users with similar roles or use cases
New accounts with behavior patterns matching known attack tools should be flagged

Output Drift Detection

Output drift occurs when the model's responses gradually change in character, quality, or alignment. This can indicate model degradation, poisoned fine-tuning, or RAG source corruption.

Types of Output Drift

Drift Type	Description	Detection Method
Quality drift	Outputs become less coherent or less accurate over time	Automated quality scoring (perplexity, BLEU, semantic coherence)
Alignment drift	Outputs become less aligned with safety policies	Periodic safety benchmark evaluation
Topic drift	Outputs shift toward unexpected topics	Topic classifier applied to output distribution
Tone drift	Output sentiment or formality changes	Sentiment analysis over rolling windows
Factuality drift	Outputs contain more ungrounded claims	Fact-checking against knowledge base

Drift Detection Methods

Statistical tests: Apply distribution comparison tests (KL divergence, Jensen-Shannon divergence, Kolmogorov-Smirnov test) to output feature distributions over time windows.

Reference-based monitoring: Maintain a "golden set" of expected responses to canonical queries. Periodically re-evaluate the model against this golden set and alert on significant changes.

Embedding drift: Track the centroid and variance of output embeddings over time. A shifting centroid indicates systematic drift; increasing variance indicates decreasing consistency.

Embedding Space Monitoring

Embedding-based monitoring provides a powerful, representation-level view of system behavior:

Input Embedding Monitoring

Plot input embeddings over time to detect shifts in what users are asking about:

Cluster emergence: New clusters of inputs appearing that were not present historically may indicate coordinated adversarial activity
Cluster drift: Existing clusters shifting location may indicate gradual topic evolution or manipulation
Outlier detection: Inputs far from any existing cluster may be novel attacks

Output Embedding Monitoring

Track output embeddings to detect changes in what the model is producing:

Alignment corridor: Define an expected region of embedding space for acceptable outputs. Outputs that fall outside this corridor may violate policy.
Forbidden regions: Define regions of embedding space corresponding to known harmful content categories. Outputs near these regions trigger alerts.
Similarity to known-harmful content: Compute similarity between outputs and a database of harmful examples.

RAG Source Monitoring

For retrieval-augmented generation systems, monitor the embedding space of retrieved documents:

Source contamination: New documents appearing in the retrieval corpus that cluster near adversarial content
Retrieval drift: Changes in which documents are most frequently retrieved
Injection detection: Documents that contain embedded instructions (high similarity to prompt injection templates)

Building an Anomaly Detection Pipeline

Establish Baselines
Collect 2-4 weeks of normal operation data. Compute statistical baselines for all key metrics. Identify natural patterns (daily cycles, weekly cycles, seasonal patterns).
Select Detection Methods
For each threat type, choose appropriate detection methods. Combine multiple methods for defense in depth. Prioritize methods based on the most likely attack vectors for your system.
Set Initial Thresholds
Start with conservative thresholds (low sensitivity, minimal false positives). Use historical data to calibrate detection rates.
Deploy Monitoring
Implement real-time monitoring for critical signals and batch monitoring for trend analysis. Ensure monitoring does not introduce significant latency.
Tune and Iterate
Review alerts weekly. Adjust thresholds based on false positive/negative rates. Add new detection rules as new attack patterns emerge. Remove rules that consistently produce false positives.

Evading Anomaly Detection

Red teamers should understand how anomaly detection can be circumvented:

Baseline pollution: Gradually introduce adversarial patterns during the baseline collection period so they become "normal"
Mimicry: Study normal user behavior and match your attack patterns to it (request timing, topic distribution, input length)
Distributed attacks: Use multiple accounts or IP addresses to keep per-user metrics within normal ranges
Gradual escalation: Change behavior slowly enough that drift detection does not trigger (the "boiling frog" approach)
Noise injection: Mix adversarial requests with legitimate requests to dilute statistical signals

AI Monitoring and Observability — the broader monitoring architecture
Logging Architecture — capturing the data anomaly detection needs
Input/Output Filtering — preventive controls that anomaly detection complements
Prompt Shields — inline detection approaches

References

"Detecting LLM Jailbreak Attempts Using Perplexity-Based Methods" - Jain et al. (2024) - Research on using perplexity as a signal for detecting adversarial LLM inputs
"Embedding Space Anomaly Detection for LLM Safety" - Chen et al. (2025) - Methods for detecting harmful outputs through embedding space monitoring
"Data Drift Detection for ML Systems" - Huyen (2024) - Comprehensive overview of drift detection methods applicable to AI monitoring
"Baseline Defenses for Adversarial Attacks Against Aligned Language Models" - Alon & Kamfonas (2023) - Survey of detection approaches for adversarial inputs to LLMs

Knowledge Check

Why is perplexity-based detection effective against GCG-style adversarial suffixes but not natural-language jailbreaks?

AI Anomaly Detection

Establish Baselines

Select Detection Methods

Set Initial Thresholds

Deploy Monitoring

Tune and Iterate

Related articles

AI Anomaly Detection

Establish Baselines

Select Detection Methods

Set Initial Thresholds

Deploy Monitoring

Tune and Iterate

Related articles