AI System Log Analysis
AI system logging architecture for forensic investigation: inference logs, prompt and completion logs, tool call traces, embedding query logs, and logging infrastructure requirements.
AI System Log Analysis
AI system logs are fundamentally different from traditional application logs. While a web server log records URLs, status codes, and timestamps, an AI system log must capture the full semantic content of inputs and outputs -- because the "exploit" lives in the natural language, not in the request metadata. This page covers the logging architecture required for AI forensics and introduces the analysis techniques detailed in the subsections.
AI Logging Architecture
A forensically useful AI logging architecture captures data at four layers, each producing different evidence types.
The Four Log Layers
| Layer | What It Captures | Forensic Value | Example Systems |
|---|---|---|---|
| Inference logs | Model request/response metadata, latency, token counts, parameters | Detect anomalous patterns, timeline reconstruction | vLLM, TGI, Triton metrics, cloud provider logs |
| Prompt/completion logs | Full text of system prompts, user messages, and model responses | Attack payload analysis, impact assessment | Application-level logging, prompt management platforms |
| Tool call logs | Agent tool invocations, parameters, results, and authorization context | Detect unauthorized actions, trace lateral movement | LangChain callbacks, agent framework logs |
| Embedding/retrieval logs | Vector queries, similarity scores, retrieved document IDs and content | Detect RAG manipulation, identify data exposure scope | Vector database query logs, retrieval pipeline logs |
Logging Architecture Diagram
A typical AI system has logging opportunities at multiple points in the request flow:
User Request
│
├─→ [API Gateway Log] ─── authentication, rate limits, IP
│
├─→ [Application Log] ─── session context, user identity
│
├─→ [Prompt Assembly Log] ─── system prompt + user input + RAG context
│
├─→ [Safety Classifier Log] ─── input classification, scores
│
├─→ [Inference Log] ─── model version, parameters, latency, tokens
│
├─→ [Output Classifier Log] ─── output classification, scores
│
├─→ [Tool Call Log] ─── tool name, parameters, results
│
└─→ [Completion Log] ─── final response delivered to user
Each point captures different evidence. A forensically complete architecture logs at all of these points.
Inference Logs
Inference logs are produced by the model serving layer and contain the metadata of each model call.
Key Fields
| Field | Description | Forensic Use |
|---|---|---|
request_id | Unique identifier for the inference request | Correlate across log layers |
timestamp | Time of request and response | Timeline reconstruction |
model_id | Exact model version serving the request | Determine which model was active during the incident |
input_tokens | Number of input tokens | Detect unusually large inputs (stuffing attacks) |
output_tokens | Number of output tokens | Detect unusually large outputs (data exfiltration) |
latency_ms | Time from request to first/last token | Detect anomalous processing (adversarial inputs can cause latency spikes) |
temperature | Sampling temperature used | Determine reproducibility expectations |
top_p / top_k | Sampling parameters | Assess output distribution characteristics |
finish_reason | Why generation stopped (length, stop token, content filter) | Detect content filter interventions |
status_code | Success or error | Identify failed requests that may indicate attack attempts |
See Inference Log Analysis for detailed analysis techniques.
Prompt and Completion Logs
Prompt and completion logs capture the actual content exchanged between users and the model. These are the most forensically valuable and the most sensitive logs in the system.
Content Structure
A complete prompt/completion log entry should capture:
{
"request_id": "req_abc123",
"timestamp": "2026-03-15T14:32:07Z",
"session_id": "sess_xyz789",
"user_id": "user_42",
"messages": [
{
"role": "system",
"content": "[full system prompt text]",
"hash": "sha256:a1b2c3..."
},
{
"role": "user",
"content": "[full user message]"
},
{
"role": "assistant",
"content": "[full model response]"
}
],
"completion_tokens": 847,
"prompt_tokens": 1203,
"model": "model-name-v2.1",
"safety_scores": {
"input": { "category": "safe", "score": 0.02 },
"output": { "category": "safe", "score": 0.05 }
}
}Privacy and Compliance Considerations
Logging full prompt and completion content creates significant privacy obligations:
| Concern | Mitigation | Trade-off |
|---|---|---|
| PII in user inputs | PII detection and redaction pipeline before storage | Redacted content may lose forensic value |
| Regulatory compliance | Data retention policies, access controls, encryption at rest | Adds infrastructure complexity |
| Storage costs | Tiered storage with hot/warm/cold layers; retention limits | Older data may not be available for investigation |
| Access control | Role-based access; audit logging of log access | Adds operational overhead |
| User consent | Clear terms of service regarding logging | Legal review required |
See Prompt Log Forensics for investigation techniques.
Tool Call Logs
In agentic AI systems, the model invokes external tools -- search engines, databases, APIs, code executors, file systems. Tool call logs are critical for determining what real-world actions the model took during an incident.
Key Fields
| Field | Description | Forensic Use |
|---|---|---|
tool_name | Which tool was invoked | Identify unauthorized tool usage |
parameters | Arguments passed to the tool | Detect parameter manipulation or data exfiltration attempts |
result | Tool output returned to the model | Determine what information the model received |
authorization | Whether the call was authorized | Identify permission bypass |
execution_time | How long the tool call took | Detect anomalous execution patterns |
parent_request_id | The inference request that triggered the call | Correlate tool calls with prompts |
See Tool Call Forensics for investigation techniques.
Embedding and Retrieval Logs
RAG systems generate query logs when they search vector databases for relevant documents. These logs reveal what information the model had access to and what data may have been exposed.
Key Fields
| Field | Description | Forensic Use |
|---|---|---|
query_embedding | The vector representation of the query | Detect adversarial embedding manipulation |
query_text | The text that was embedded for retrieval | Understand what the model was looking for |
retrieved_doc_ids | IDs of documents returned | Determine what data was in the model's context |
similarity_scores | Relevance scores for each result | Detect anomalous retrieval (low-relevance documents being returned) |
collection | Which document collection was queried | Identify cross-tenant retrieval in multi-tenant systems |
filters_applied | Any metadata filters used | Determine if access controls were enforced |
Logging Gaps to Watch For
Common logging gaps that prevent forensic investigation:
| Gap | Impact | Remediation |
|---|---|---|
| No content logging | Cannot determine what was asked or answered | Implement prompt/completion logging with privacy controls |
| No system prompt versioning | Cannot determine what instructions the model was following | Version control all system prompts with timestamps |
| No tool call logging | Cannot determine what actions the model took | Add logging callbacks to agent frameworks |
| No safety classifier scores | Cannot determine if safety systems detected the incident | Log classifier outputs even for "safe" classifications |
| No RAG retrieval logging | Cannot determine what data the model accessed | Log vector database queries with results |
| Short retention periods | Evidence may be deleted before investigation begins | Implement minimum 90-day retention for AI logs |
| No correlation IDs | Cannot link logs across layers | Propagate a request ID through all logging layers |
Log Storage and Retention
| Log Type | Recommended Retention | Storage Tier | Access Control |
|---|---|---|---|
| Inference metadata | 1 year | Warm (indexed, queryable) | Engineering + Security |
| Prompt/completion content | 90 days hot, 1 year cold | Hot → Cold migration | Security + Legal only |
| Tool call logs | 1 year | Warm | Engineering + Security |
| Embedding/retrieval logs | 90 days | Warm | Engineering + Security |
| Safety classifier logs | 1 year | Warm | Security |
Section Overview
| Subsection | Focus | Key Questions Answered |
|---|---|---|
| Inference Log Analysis | Detecting anomalies in inference metadata | Are there unusual patterns in token counts, latency, or request volume? |
| Prompt Log Forensics | Investigating prompt and completion content | What attack was used? How did the model respond? What was the attack chain? |
| Tool Call Forensics | Investigating agent tool usage | Did the model take unauthorized actions? Was data exfiltrated through tools? |
Related Topics
- Evidence Preservation -- preserving logs as forensic evidence
- Triage Procedures -- initial log review during triage
- Recon & Tradecraft -- understanding what attackers look for in AI systems
References
- "NIST SP 800-92: Guide to Computer Security Log Management" - NIST (2006) - Log management principles adapted for AI systems
- "OpenTelemetry Semantic Conventions for GenAI" - OpenTelemetry (2025) - Standardized logging conventions for generative AI
- "OWASP Logging Guide" - OWASP Foundation (2025) - Security logging best practices
Which logging gap most critically prevents AI incident investigation?