AI Monitoring and Observability
What to monitor in AI systems, key metrics for detecting abuse and drift, alerting strategies, and observability architecture for LLM applications.
Why AI Monitoring Is Different
Traditional application monitoring focuses on uptime, latency, and error rates. AI monitoring must go further — tracking the behavior of a probabilistic system whose outputs are unpredictable by design. A traditional application either works or it does not. An AI application can "work" (return a 200 status code with a valid response) while producing harmful, incorrect, or manipulated content.
What to Monitor
AI monitoring spans three domains: system metrics, behavioral metrics, and security metrics.
System Metrics
These are the traditional observability metrics adapted for AI workloads:
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Latency (TTFT) | Time to first token | Abnormally high latency may indicate adversarial inputs that cause excessive computation |
| Latency (total) | Total response generation time | Sudden increases may signal prompt injection triggering complex tool chains |
| Token throughput | Tokens processed per second | Drops indicate resource contention or attack |
| Error rate | Percentage of failed requests | Spikes may indicate automated attack attempts |
| Token usage | Input/output tokens per request | Unusually high usage may indicate extraction attempts or context stuffing |
| Cost | Dollar spend per request/user/period | Cost spikes from high-token requests or excessive tool usage |
| GPU utilization | Compute resource consumption | Sustained high utilization may indicate denial-of-service |
Behavioral Metrics
These metrics track the quality and appropriateness of model outputs:
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Refusal rate | Percentage of requests the model refuses to answer | Sudden drops may indicate successful jailbreaking |
| Topic distribution | Distribution of conversation topics over time | Shifts may indicate systematic exploitation |
| Output similarity | Cosine similarity between outputs and known-harmful templates | Detects outputs that match harmful content patterns |
| Tool call frequency | How often the model invokes tools | Spikes may indicate tool abuse through prompt injection |
| Tool call patterns | Which tools are called and with what arguments | Unexpected tool calls may indicate adversarial manipulation |
| Sentiment drift | Changes in output sentiment over time | Gradual shifts may indicate subtle manipulation |
| Hallucination rate | Percentage of ungrounded claims in output | Increases may indicate poisoned retrieval sources |
Security Metrics
These metrics specifically target adversarial activity:
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Injection attempt rate | Requests flagged by prompt injection detectors | Tracks attack volume and trends |
| Guardrail trigger rate | How often each guardrail blocks a request | Changes indicate new attack patterns or guardrail degradation |
| System prompt leakage | Outputs containing system prompt fragments | Indicates successful extraction attempts |
| PII exposure rate | Outputs containing detected PII | Tracks data leakage |
| API key rotation frequency | How often compromised keys are detected | Indicates key management health |
| User anomaly score | Per-user behavioral deviation from baseline | Identifies accounts being used for adversarial testing |
Alerting Strategy
Effective alerting balances detection sensitivity against alert fatigue. For AI systems, this balance is particularly challenging because the system's outputs are inherently variable.
Alert Tiers
Critical (Immediate Response)
Conditions that indicate active exploitation or data breach. Examples: system prompt fully extracted, PII appearing in outputs, tool calls to unauthorized services, cost spike exceeding threshold. Response: page on-call, automatic mitigation (rate limit or block user).
High (Within 1 Hour)
Conditions that suggest ongoing attack or significant drift. Examples: sustained increase in guardrail trigger rate, new jailbreak pattern appearing repeatedly, abnormal tool call patterns. Response: alert security team, investigate within the hour.
Medium (Within 1 Business Day)
Conditions that indicate potential issues requiring investigation. Examples: gradual change in topic distribution, increasing refusal rate (possible over-filtering), new user accounts with unusually high usage. Response: queue for investigation.
Low (Weekly Review)
Trends and patterns for ongoing security posture assessment. Examples: slow drift in output quality metrics, changes in user population behavior, emerging patterns in blocked requests. Response: include in weekly security review.
Dynamic Baselines
Static alert thresholds fail for AI systems because normal behavior varies with usage patterns, model updates, and seasonal changes. Use dynamic baselines:
- Rolling window baselines: Compare current metrics against the same metric over the past 7-30 days
- Percentile-based thresholds: Alert when a metric exceeds the 99th percentile of its historical distribution
- Rate-of-change alerts: Alert when a metric changes by more than N% within a time window
- Cohort comparison: Compare a user's behavior against their cohort rather than a global threshold
Observability Architecture
A complete AI observability stack has four layers:
Layer 1: Collection
Capture all relevant data at the point of generation:
- Request/response logging: Every prompt and completion with metadata (user ID, timestamp, model, parameters)
- Guardrail decisions: Every guardrail evaluation with its score and decision
- Tool call traces: Every tool invocation with arguments, results, and timing
- Infrastructure metrics: GPU utilization, memory, latency, error rates
Layer 2: Storage
Store collected data in systems optimized for the access patterns needed:
| Data Type | Storage | Retention | Access Pattern |
|---|---|---|---|
| Metrics | Time-series DB (Prometheus, InfluxDB) | 90 days at full resolution | Dashboard queries, alerting |
| Logs | Log aggregator (Elasticsearch, Loki) | 30-90 days | Full-text search, investigation |
| Traces | Trace storage (Jaeger, Tempo) | 14-30 days | Request flow analysis |
| Conversations | Object store (S3) with metadata index | Per policy (30 days to 7 years) | Incident investigation, compliance |
Layer 3: Analysis
Process stored data to generate insights and detect anomalies:
- Real-time stream processing: Kafka/Flink for immediate pattern detection
- Batch analytics: Periodic analysis of accumulated data for trend detection
- ML-based anomaly detection: Models trained on normal behavior to detect deviations
- Embedding-based similarity: Compare outputs against known-harmful content databases
Layer 4: Visualization and Response
Present analysis results and enable action:
- Dashboards: Real-time visibility into system health and security posture
- Alert management: Routing, escalation, and tracking of security alerts
- Incident response tools: Ability to block users, revoke keys, and adjust guardrails in response to detected threats
- Reporting: Compliance reports, security posture reports, and trend analysis
AI-Specific Observability Tools
Several tools have emerged specifically for AI observability:
| Tool | Focus | Key Features |
|---|---|---|
| LangSmith | LangChain applications | Trace visualization, prompt versioning, evaluation |
| Langfuse | Open-source LLM observability | Tracing, scoring, prompt management |
| Weights & Biases (W&B) | ML experiment tracking | Training monitoring, model evaluation |
| Arize Phoenix | LLM and ML observability | Embedding drift detection, LLM tracing |
| Helicone | LLM usage analytics | Cost tracking, caching, rate limiting |
| OpenLLMetry | OpenTelemetry for LLMs | Standard instrumentation for LLM calls |
Monitoring as a Red Team Target
From the red team perspective, monitoring is both a constraint and a target:
Evading Monitoring
- Low and slow: Spread attacks across time to avoid rate-based detection
- Mimicking normal behavior: Match your request patterns to legitimate users
- Multiple accounts: Distribute attack attempts across accounts to avoid per-user anomaly detection
- Gradual escalation: Avoid sudden behavioral changes that trigger change-rate alerts
Attacking Monitoring
- Alert flooding: Generate high volumes of low-severity alerts to cause fatigue and mask real attacks
- Log injection: Inject misleading content into logs to complicate incident investigation
- Monitoring blind spots: Identify data that is not captured (e.g., streaming responses not logged, tool call arguments not recorded)
- Retention exploitation: Execute attacks and then wait for log retention to expire before launching the real attack
Related Topics
- Anomaly Detection — detecting jailbreak attempts and unusual patterns
- Logging Architecture — what to capture and how to store it
- Guardrails Architecture — the preventive controls that monitoring complements
- Runtime Monitoring — monitoring as a remediation strategy
References
- "Monitoring Machine Learning Models in Production" - Google (2024) - Comprehensive guide to ML monitoring covering data drift, model performance, and operational metrics
- "LLM Observability: A Practical Guide" - Arize AI (2025) - Practical patterns for implementing LLM-specific observability
- "OpenTelemetry for AI: Instrumenting LLM Applications" - OpenTelemetry Community (2025) - Standard instrumentation approaches for AI application observability
- "Detecting Adversarial Attacks on LLM Applications" - Microsoft Research (2024) - Research on monitoring-based detection of adversarial activity targeting LLM applications
Why is monitoring considered the 'last line of defense' for AI systems?