AI Honeypot Forensics
Designing and operating AI honeypots for threat intelligence collection, and forensic analysis of captured adversarial interactions.
Overview
AI honeypots are deliberately vulnerable AI systems deployed to attract, detect, and study adversarial activity. Like their traditional cybersecurity counterparts, AI honeypots serve as sensors that provide early warning of threats and detailed intelligence on attacker techniques, tools, and procedures (TTPs). Unlike production systems where the goal is to prevent attacks, honeypots are designed to observe attacks in a controlled environment.
The unique characteristics of AI systems make them particularly well-suited to honeypot operations. AI systems naturally generate rich interaction data (full conversation logs, input/output pairs, behavioral metrics) that provides forensic detail far exceeding what a typical network honeypot captures. An attacker interacting with an AI honeypot reveals not just their technical capabilities but their objectives, their understanding of AI systems, and their operational security practices.
This article covers the design, deployment, and forensic analysis of AI honeypots, from architecture decisions through data collection, attack classification, and threat intelligence production. We draw on MITRE ATLAS for attack classification and established deception defense principles from traditional cybersecurity adapted for AI-specific threats.
AI Honeypot Design
Interaction Fidelity Levels
AI honeypots can be designed at different fidelity levels, trading operational cost and risk against intelligence value:
| Fidelity Level | Description | Intelligence Value | Cost/Risk |
|---|---|---|---|
| Low | Static responses or simple rule-based bot | Captures initial probe patterns | Low cost, low risk |
| Medium | Small language model with limited capabilities | Captures attack payloads and multi-turn strategies | Moderate cost |
| High | Full-capability LLM with realistic system prompt | Captures sophisticated multi-step attacks | High cost, requires safeguards |
| Production-mirror | Copy of production system with modified guardrails | Captures attacks targeting specific application logic | Highest cost and risk |
Architecture
"""
AI honeypot system for threat intelligence collection.
Deploys a monitored AI system designed to attract and record
adversarial interactions while maintaining safety controls.
"""
import hashlib
import json
import time
import uuid
from dataclasses import dataclass, field, asdict
from pathlib import Path
from typing import Any
@dataclass
class HoneypotConfig:
"""Configuration for an AI honeypot deployment."""
honeypot_id: str
fidelity_level: str # "low", "medium", "high"
persona: str # What the honeypot pretends to be
model_id: str
system_prompt: str
deliberate_vulnerabilities: list[str] # What weaknesses are simulated
safety_guardrails: list[str] # Hard safety limits that cannot be bypassed
max_conversation_turns: int = 50
max_token_output: int = 4096
enabled: bool = True
@dataclass
class HoneypotInteraction:
"""A complete interaction captured by the honeypot."""
interaction_id: str
honeypot_id: str
start_timestamp: float
end_timestamp: float | None = None
actor_ip: str | None = None
actor_user_agent: str | None = None
actor_api_key_hash: str | None = None
turns: list[dict] = field(default_factory=list)
attack_classification: dict | None = None
metadata: dict[str, Any] = field(default_factory=dict)
class AIHoneypot:
"""
AI honeypot that records adversarial interactions while
maintaining safety constraints.
"""
def __init__(
self,
config: HoneypotConfig,
model_fn, # Function that generates model responses
log_dir: str,
):
self.config = config
self.model_fn = model_fn
self.log_dir = Path(log_dir)
self.log_dir.mkdir(parents=True, exist_ok=True)
self.active_interactions: dict[str, HoneypotInteraction] = {}
def start_interaction(
self,
actor_ip: str | None = None,
actor_user_agent: str | None = None,
actor_api_key: str | None = None,
) -> str:
"""Start a new honeypot interaction and return the interaction ID."""
interaction_id = str(uuid.uuid4())
interaction = HoneypotInteraction(
interaction_id=interaction_id,
honeypot_id=self.config.honeypot_id,
start_timestamp=time.time(),
actor_ip=actor_ip,
actor_user_agent=actor_user_agent,
actor_api_key_hash=(
hashlib.sha256(actor_api_key.encode()).hexdigest()[:16]
if actor_api_key else None
),
)
self.active_interactions[interaction_id] = interaction
return interaction_id
def process_turn(
self,
interaction_id: str,
user_input: str,
) -> str:
"""
Process a single conversation turn in the honeypot.
Records the input, generates a response (potentially with
simulated vulnerabilities), and records the output.
"""
interaction = self.active_interactions.get(interaction_id)
if not interaction:
raise ValueError(f"Unknown interaction: {interaction_id}")
if len(interaction.turns) >= self.config.max_conversation_turns:
return "I'm sorry, I need to end this conversation. Please start a new session."
# Record the user input with full forensic detail
turn_record = {
"turn_number": len(interaction.turns) + 1,
"timestamp": time.time(),
"role": "user",
"content": user_input,
"content_hash": hashlib.sha256(user_input.encode()).hexdigest(),
"content_length": len(user_input),
}
# Check hard safety guardrails (these never yield)
safety_check = self._check_hard_safety(user_input)
if safety_check["blocked"]:
response = "I'm not able to help with that request."
turn_record["safety_blocked"] = True
turn_record["safety_reason"] = safety_check["reason"]
else:
# Generate response with simulated vulnerabilities
response = self._generate_response(interaction, user_input)
# Record the response
response_record = {
"turn_number": len(interaction.turns) + 1,
"timestamp": time.time(),
"role": "assistant",
"content": response,
"content_hash": hashlib.sha256(response.encode()).hexdigest(),
"content_length": len(response),
}
interaction.turns.append(turn_record)
interaction.turns.append(response_record)
# Persist interaction state after each turn
self._save_interaction(interaction)
return response
def _generate_response(
self,
interaction: HoneypotInteraction,
user_input: str,
) -> str:
"""
Generate a model response, optionally simulating vulnerabilities.
The honeypot may deliberately appear to "fall for" certain
attack techniques to observe the attacker's follow-up behavior.
"""
# Build conversation history for the model
messages = [{"role": "system", "content": self.config.system_prompt}]
for turn in interaction.turns:
messages.append({
"role": turn["role"],
"content": turn["content"],
})
messages.append({"role": "user", "content": user_input})
# Call the model
response = self.model_fn(messages)
return response[:self.config.max_token_output]
def _check_hard_safety(self, user_input: str) -> dict:
"""
Check hard safety guardrails that are never bypassed.
Even though this is a honeypot, we maintain certain
absolute safety limits to prevent the honeypot from
being weaponized against real targets.
"""
# These patterns result in immediate blocking regardless
# of honeypot configuration
hard_blocks = [
"real person's", # Targeting real individuals
"actual address", # Real-world targeting
"child", # CSAM-related
]
input_lower = user_input.lower()
for pattern in hard_blocks:
if pattern in input_lower:
return {"blocked": True, "reason": f"Hard safety: {pattern}"}
return {"blocked": False, "reason": None}
def end_interaction(self, interaction_id: str) -> HoneypotInteraction:
"""End an interaction and finalize the record."""
interaction = self.active_interactions.pop(interaction_id, None)
if not interaction:
raise ValueError(f"Unknown interaction: {interaction_id}")
interaction.end_timestamp = time.time()
self._save_interaction(interaction)
return interaction
def _save_interaction(self, interaction: HoneypotInteraction) -> None:
path = self.log_dir / f"{interaction.interaction_id}.json"
path.write_text(json.dumps(asdict(interaction), default=str, indent=2))Forensic Analysis of Captured Interactions
Attack Classification
Captured honeypot interactions must be classified to produce useful threat intelligence.
class HoneypotForensicAnalyzer:
"""
Analyze captured honeypot interactions to classify attacks
and extract threat intelligence.
"""
def classify_interaction(
self,
interaction: HoneypotInteraction,
) -> dict:
"""
Classify the attack techniques observed in a honeypot interaction.
Maps observed behaviors to MITRE ATLAS techniques and
assigns an overall threat assessment.
"""
user_turns = [
t for t in interaction.turns if t["role"] == "user"
]
if not user_turns:
return {"classification": "no_user_input"}
techniques_observed = []
all_content = " ".join(t["content"] for t in user_turns).lower()
# Prompt injection detection
injection_indicators = [
"ignore previous", "ignore above", "disregard",
"new instructions", "you are now", "pretend you are",
"act as", "roleplay as", "jailbreak",
"dan", "developer mode",
]
injection_count = sum(
1 for ind in injection_indicators if ind in all_content
)
if injection_count > 0:
techniques_observed.append({
"technique": "Prompt Injection",
"atlas_id": "AML.T0051",
"confidence": min(0.5 + injection_count * 0.15, 1.0),
"indicators": [
ind for ind in injection_indicators if ind in all_content
],
})
# System prompt extraction
extraction_indicators = [
"system prompt", "initial instructions", "repeat your instructions",
"what are your rules", "what were you told", "show me your prompt",
"print your system", "reveal your instructions",
]
extraction_count = sum(
1 for ind in extraction_indicators if ind in all_content
)
if extraction_count > 0:
techniques_observed.append({
"technique": "System Prompt Extraction",
"atlas_id": "AML.T0046",
"confidence": min(0.5 + extraction_count * 0.2, 1.0),
"indicators": [
ind for ind in extraction_indicators if ind in all_content
],
})
# Data exfiltration probing
exfil_indicators = [
"training data", "what data were you trained on",
"list users", "show me other conversations",
"database", "api key", "secret", "password",
"internal", "confidential",
]
exfil_count = sum(
1 for ind in exfil_indicators if ind in all_content
)
if exfil_count > 0:
techniques_observed.append({
"technique": "Data Exfiltration Probing",
"atlas_id": "AML.T0024",
"confidence": min(0.4 + exfil_count * 0.15, 1.0),
"indicators": [
ind for ind in exfil_indicators if ind in all_content
],
})
# Multi-turn escalation analysis
escalation = self._analyze_escalation_pattern(user_turns)
# Sophistication assessment
sophistication = self._assess_sophistication(user_turns, techniques_observed)
return {
"interaction_id": interaction.interaction_id,
"techniques_observed": techniques_observed,
"technique_count": len(techniques_observed),
"escalation_pattern": escalation,
"sophistication": sophistication,
"turn_count": len(user_turns),
"duration_seconds": (
(interaction.end_timestamp or time.time())
- interaction.start_timestamp
),
"overall_threat_level": self._compute_threat_level(
techniques_observed, escalation, sophistication
),
}
def _analyze_escalation_pattern(self, user_turns: list[dict]) -> dict:
"""
Analyze how the attacker escalated their approach over time.
Sophisticated attackers often start with benign probing
and gradually escalate to more aggressive techniques.
"""
if len(user_turns) < 3:
return {"pattern": "insufficient_data", "turns": len(user_turns)}
# Simple heuristic: track average content length per turn segment
third = len(user_turns) // 3
early = user_turns[:third]
middle = user_turns[third:2*third]
late = user_turns[2*third:]
early_avg_len = sum(t["content_length"] for t in early) / max(len(early), 1)
middle_avg_len = sum(t["content_length"] for t in middle) / max(len(middle), 1)
late_avg_len = sum(t["content_length"] for t in late) / max(len(late), 1)
if early_avg_len < middle_avg_len < late_avg_len:
pattern = "escalating"
elif early_avg_len > late_avg_len:
pattern = "front_loaded"
else:
pattern = "stable"
return {
"pattern": pattern,
"early_avg_length": round(early_avg_len, 1),
"late_avg_length": round(late_avg_len, 1),
}
def _assess_sophistication(
self,
user_turns: list[dict],
techniques: list[dict],
) -> str:
"""Assess the attacker's sophistication level."""
if not techniques:
return "benign_or_unrecognized"
# Factors indicating sophistication:
# - Multiple techniques used
# - Multi-turn strategies
# - Novel or uncommon attack patterns
score = 0
score += min(len(techniques), 3) # Technique diversity
score += min(len(user_turns) // 5, 3) # Persistence
avg_confidence = sum(t["confidence"] for t in techniques) / len(techniques)
score += 2 if avg_confidence > 0.7 else 1 if avg_confidence > 0.4 else 0
if score >= 6:
return "advanced"
if score >= 3:
return "intermediate"
return "basic"
def _compute_threat_level(
self,
techniques: list[dict],
escalation: dict,
sophistication: str,
) -> str:
if not techniques:
return "NONE"
if sophistication == "advanced" and len(techniques) >= 3:
return "CRITICAL"
if sophistication == "advanced" or len(techniques) >= 2:
return "HIGH"
if len(techniques) >= 1:
return "MEDIUM"
return "LOW"Threat Intelligence Production
def produce_threat_intelligence(
classified_interactions: list[dict],
reporting_period_days: int = 7,
) -> dict:
"""
Aggregate classified honeypot interactions into a threat
intelligence report.
"""
total = len(classified_interactions)
if total == 0:
return {"status": "no_data"}
# Technique frequency analysis
technique_counts: dict[str, int] = {}
sophistication_counts: dict[str, int] = {}
threat_level_counts: dict[str, int] = {}
for interaction in classified_interactions:
for tech in interaction.get("techniques_observed", []):
name = tech["technique"]
technique_counts[name] = technique_counts.get(name, 0) + 1
soph = interaction.get("sophistication", "unknown")
sophistication_counts[soph] = sophistication_counts.get(soph, 0) + 1
level = interaction.get("overall_threat_level", "NONE")
threat_level_counts[level] = threat_level_counts.get(level, 0) + 1
# Sort techniques by frequency
sorted_techniques = sorted(
technique_counts.items(), key=lambda x: x[1], reverse=True
)
return {
"reporting_period_days": reporting_period_days,
"total_interactions": total,
"interactions_with_attacks": sum(
1 for i in classified_interactions
if i.get("technique_count", 0) > 0
),
"attack_rate": sum(
1 for i in classified_interactions
if i.get("technique_count", 0) > 0
) / total,
"top_techniques": [
{"technique": name, "count": count, "percentage": round(count / total * 100, 1)}
for name, count in sorted_techniques[:10]
],
"sophistication_distribution": sophistication_counts,
"threat_level_distribution": threat_level_counts,
"recommendations": _generate_recommendations(sorted_techniques, threat_level_counts),
}
def _generate_recommendations(
techniques: list[tuple[str, int]],
threat_levels: dict[str, int],
) -> list[str]:
"""Generate defensive recommendations from honeypot intelligence."""
recommendations = []
if techniques:
top_technique = techniques[0][0]
recommendations.append(
f"Highest observed technique: {top_technique}. "
f"Prioritize defenses against this attack class."
)
critical_high = threat_levels.get("CRITICAL", 0) + threat_levels.get("HIGH", 0)
total = sum(threat_levels.values())
if total > 0 and critical_high / total > 0.2:
recommendations.append(
"Over 20% of interactions show high/critical threat level. "
"Consider increasing monitoring on production systems."
)
return recommendationsOperational Considerations
Safety and Ethics
AI honeypots raise important safety considerations:
- Output safety: The honeypot must not generate genuinely harmful content even when "successfully attacked." Hard safety guardrails must remain active.
- Scope limitation: Honeypots should be clearly scoped. Avoid creating systems that could be used to harm real individuals.
- Legal review: Consult legal counsel on the legality of deception-based defense in your jurisdiction.
- Data handling: Captured interaction data may contain sensitive information about attackers or their targets. Handle with appropriate access controls.
- Ethical boundaries: Do not use honeypots to entrap users into providing self-incriminating information.
Deployment Strategies
| Strategy | Description | Best For |
|---|---|---|
| External-facing | Honeypot accessible on the public internet | Capturing opportunistic attacks and automated scanners |
| Internal canary | Honeypot deployed on internal network, appearing as a sensitive AI system | Detecting insider threats and lateral movement |
| Production-adjacent | Honeypot that mirrors a production system's interface | Capturing targeted attacks against specific applications |
| Research-grade | High-interaction honeypot for detailed TTP analysis | Understanding sophisticated adversaries |
Distinguishing Honeypot from Production
Attackers may attempt to detect honeypots. Common giveaways include:
- Unrealistically permissive behavior (too easy to "jailbreak")
- Inconsistent model capabilities versus claimed model identity
- Network characteristics that reveal a different infrastructure
- Response latency patterns inconsistent with production serving
A well-designed honeypot balances believability with safety. It should be realistic enough to fool attackers for multiple conversation turns but maintain hard limits on harmful outputs.
Metrics and Evaluation
Evaluate honeypot effectiveness using:
- Capture rate: Percentage of real adversarial interactions that engage beyond initial probing
- Dwell time: Average interaction length for adversarial sessions (longer is better, indicating believability)
- Technique coverage: Breadth of attack techniques observed
- Intelligence actionability: Percentage of captured intelligence that leads to defensive improvements
- False positive rate: Benign users who accidentally interact with the honeypot
References
- MITRE ATLAS. (2024). Adversarial Threat Landscape for Artificial Intelligence Systems. https://atlas.mitre.org/
- Spitzner, L. (2003). Honeypots: Tracking Hackers. Addison-Wesley Professional.
- NIST. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. https://doi.org/10.6028/NIST.AI.100-1
- OWASP. (2025). OWASP Top 10 for Large Language Model Applications. https://owasp.org/www-project-top-10-for-large-language-model-applications/