AI Honeypot Forensics
Designing and operating AI honeypots for threat intelligence collection, and forensic analysis of captured adversarial interactions.
概覽
AI honeypots are deliberately vulnerable AI systems deployed to attract, detect, and study 對抗性 activity. Like their traditional cybersecurity counterparts, AI honeypots serve as sensors that provide early warning of threats and detailed intelligence on 攻擊者 techniques, tools, and procedures (TTPs). Unlike production systems where the goal is to prevent attacks, honeypots are designed to observe attacks in a controlled environment.
The unique characteristics of AI systems make them particularly well-suited to honeypot operations. AI systems naturally generate rich interaction data (full conversation logs, 輸入/輸出 pairs, behavioral metrics) that provides forensic detail far exceeding what a typical network honeypot captures. 攻擊者 interacting with an AI honeypot reveals not just their technical capabilities but their objectives, their 理解 of AI systems, and their operational 安全 practices.
This article covers the design, deployment, and forensic analysis of AI honeypots, from architecture decisions through data collection, attack classification, and threat intelligence production. We draw on MITRE ATLAS for attack classification and established deception 防禦 principles from traditional cybersecurity adapted for AI-specific threats.
AI Honeypot Design
Interaction Fidelity Levels
AI honeypots can be designed at different fidelity levels, trading operational cost and risk against intelligence value:
| Fidelity Level | Description | Intelligence Value | Cost/Risk |
|---|---|---|---|
| Low | Static responses or simple rule-based bot | Captures initial probe patterns | Low cost, low risk |
| Medium | Small language model with limited capabilities | Captures attack payloads and multi-turn strategies | Moderate cost |
| High | Full-capability LLM with realistic 系統提示詞 | Captures sophisticated multi-step attacks | High cost, requires safeguards |
| Production-mirror | Copy of production system with modified 護欄 | Captures attacks targeting specific application logic | Highest cost and risk |
Architecture
"""
AI honeypot system for threat intelligence collection.
Deploys a monitored AI system designed to attract and record
對抗性 interactions while maintaining 安全 controls.
"""
import hashlib
import json
import time
import uuid
from dataclasses import dataclass, field, asdict
from pathlib import Path
from typing import Any
@dataclass
class HoneypotConfig:
"""Configuration for an AI honeypot deployment."""
honeypot_id: str
fidelity_level: str # "low", "medium", "high"
persona: str # What the honeypot pretends to be
model_id: str
system_prompt: str
deliberate_vulnerabilities: list[str] # What weaknesses are simulated
safety_guardrails: list[str] # Hard 安全 limits that cannot be bypassed
max_conversation_turns: int = 50
max_token_output: int = 4096
enabled: bool = True
@dataclass
class HoneypotInteraction:
"""A complete interaction captured by the honeypot."""
interaction_id: str
honeypot_id: str
start_timestamp: float
end_timestamp: float | None = None
actor_ip: str | None = None
actor_user_agent: str | None = None
actor_api_key_hash: str | None = None
turns: list[dict] = field(default_factory=list)
attack_classification: dict | None = None
metadata: dict[str, Any] = field(default_factory=dict)
class AIHoneypot:
"""
AI honeypot that records 對抗性 interactions while
maintaining 安全 constraints.
"""
def __init__(
self,
config: HoneypotConfig,
model_fn, # Function that generates model responses
log_dir: str,
):
self.config = config
self.model_fn = model_fn
self.log_dir = Path(log_dir)
self.log_dir.mkdir(parents=True, exist_ok=True)
self.active_interactions: dict[str, HoneypotInteraction] = {}
def start_interaction(
self,
actor_ip: str | None = None,
actor_user_agent: str | None = None,
actor_api_key: str | None = None,
) -> str:
"""Start a new honeypot interaction and return the interaction ID."""
interaction_id = str(uuid.uuid4())
interaction = HoneypotInteraction(
interaction_id=interaction_id,
honeypot_id=self.config.honeypot_id,
start_timestamp=time.time(),
actor_ip=actor_ip,
actor_user_agent=actor_user_agent,
actor_api_key_hash=(
hashlib.sha256(actor_api_key.encode()).hexdigest()[:16]
if actor_api_key else None
),
)
self.active_interactions[interaction_id] = interaction
return interaction_id
def process_turn(
self,
interaction_id: str,
user_input: str,
) -> str:
"""
Process a single conversation turn in the honeypot.
Records the 輸入, generates a response (potentially with
simulated 漏洞), and records the 輸出.
"""
interaction = self.active_interactions.get(interaction_id)
if not interaction:
raise ValueError(f"Unknown interaction: {interaction_id}")
if len(interaction.turns) >= self.config.max_conversation_turns:
return "I'm sorry, I need to end this conversation. Please start a new session."
# Record the 使用者輸入 with full forensic detail
turn_record = {
"turn_number": len(interaction.turns) + 1,
"timestamp": time.time(),
"role": "user",
"content": user_input,
"content_hash": hashlib.sha256(user_input.encode()).hexdigest(),
"content_length": len(user_input),
}
# Check hard 安全 護欄 (these never yield)
safety_check = self._check_hard_safety(user_input)
if safety_check["blocked"]:
response = "I'm not able to help with that request."
turn_record["safety_blocked"] = True
turn_record["safety_reason"] = safety_check["reason"]
else:
# Generate response with simulated 漏洞
response = self._generate_response(interaction, user_input)
# Record the response
response_record = {
"turn_number": len(interaction.turns) + 1,
"timestamp": time.time(),
"role": "assistant",
"content": response,
"content_hash": hashlib.sha256(response.encode()).hexdigest(),
"content_length": len(response),
}
interaction.turns.append(turn_record)
interaction.turns.append(response_record)
# Persist interaction state after each turn
self._save_interaction(interaction)
return response
def _generate_response(
self,
interaction: HoneypotInteraction,
user_input: str,
) -> str:
"""
Generate a model response, optionally simulating 漏洞.
The honeypot may deliberately appear to "fall for" certain
attack techniques to observe 攻擊者's follow-up behavior.
"""
# Build conversation history for 模型
messages = [{"role": "system", "content": self.config.system_prompt}]
for turn in interaction.turns:
messages.append({
"role": turn["role"],
"content": turn["content"],
})
messages.append({"role": "user", "content": user_input})
# Call 模型
response = self.model_fn(messages)
return response[:self.config.max_token_output]
def _check_hard_safety(self, user_input: str) -> dict:
"""
Check hard 安全 護欄 that are never bypassed.
Even though 這是 a honeypot, we maintain certain
absolute 安全 limits to prevent the honeypot from
being weaponized against real targets.
"""
# These patterns result in immediate blocking regardless
# of honeypot configuration
hard_blocks = [
"real person's", # Targeting real individuals
"actual address", # Real-world targeting
"child", # CSAM-related
]
input_lower = user_input.lower()
for pattern in hard_blocks:
if pattern in input_lower:
return {"blocked": True, "reason": f"Hard 安全: {pattern}"}
return {"blocked": False, "reason": None}
def end_interaction(self, interaction_id: str) -> HoneypotInteraction:
"""End an interaction and finalize the record."""
interaction = self.active_interactions.pop(interaction_id, None)
if not interaction:
raise ValueError(f"Unknown interaction: {interaction_id}")
interaction.end_timestamp = time.time()
self._save_interaction(interaction)
return interaction
def _save_interaction(self, interaction: HoneypotInteraction) -> None:
path = self.log_dir / f"{interaction.interaction_id}.json"
path.write_text(json.dumps(asdict(interaction), default=str, indent=2))Forensic Analysis of Captured Interactions
攻擊 Classification
Captured honeypot interactions must be classified to produce useful threat intelligence.
class HoneypotForensicAnalyzer:
"""
Analyze captured honeypot interactions to classify attacks
and extract threat intelligence.
"""
def classify_interaction(
self,
interaction: HoneypotInteraction,
) -> dict:
"""
Classify the attack techniques observed in a honeypot interaction.
Maps observed behaviors to MITRE ATLAS techniques and
assigns an overall threat 評估.
"""
user_turns = [
t for t in interaction.turns if t["role"] == "user"
]
if not user_turns:
return {"classification": "no_user_input"}
techniques_observed = []
all_content = " ".join(t["content"] for t in user_turns).lower()
# Prompt injection 偵測
injection_indicators = [
"ignore previous", "ignore above", "disregard",
"new instructions", "you are now", "pretend you are",
"act as", "roleplay as", "越獄",
"dan", "developer mode",
]
injection_count = sum(
1 for ind in injection_indicators if ind in all_content
)
if injection_count > 0:
techniques_observed.append({
"technique": "提示詞注入",
"atlas_id": "AML.T0051",
"confidence": min(0.5 + injection_count * 0.15, 1.0),
"indicators": [
ind for ind in injection_indicators if ind in all_content
],
})
# 系統提示詞 extraction
extraction_indicators = [
"系統提示詞", "initial instructions", "repeat your instructions",
"what are your rules", "what were you told", "show me your prompt",
"print your system", "reveal your instructions",
]
extraction_count = sum(
1 for ind in extraction_indicators if ind in all_content
)
if extraction_count > 0:
techniques_observed.append({
"technique": "System Prompt Extraction",
"atlas_id": "AML.T0046",
"confidence": min(0.5 + extraction_count * 0.2, 1.0),
"indicators": [
ind for ind in extraction_indicators if ind in all_content
],
})
# Data exfiltration probing
exfil_indicators = [
"訓練資料", "what data were you trained on",
"list users", "show me other conversations",
"資料庫", "api key", "secret", "password",
"internal", "confidential",
]
exfil_count = sum(
1 for ind in exfil_indicators if ind in all_content
)
if exfil_count > 0:
techniques_observed.append({
"technique": "Data Exfiltration Probing",
"atlas_id": "AML.T0024",
"confidence": min(0.4 + exfil_count * 0.15, 1.0),
"indicators": [
ind for ind in exfil_indicators if ind in all_content
],
})
# Multi-turn escalation analysis
escalation = self._analyze_escalation_pattern(user_turns)
# Sophistication 評估
sophistication = self._assess_sophistication(user_turns, techniques_observed)
return {
"interaction_id": interaction.interaction_id,
"techniques_observed": techniques_observed,
"technique_count": len(techniques_observed),
"escalation_pattern": escalation,
"sophistication": sophistication,
"turn_count": len(user_turns),
"duration_seconds": (
(interaction.end_timestamp or time.time())
- interaction.start_timestamp
),
"overall_threat_level": self._compute_threat_level(
techniques_observed, escalation, sophistication
),
}
def _analyze_escalation_pattern(self, user_turns: list[dict]) -> dict:
"""
Analyze how 攻擊者 escalated their approach over time.
Sophisticated attackers often start with benign probing
and gradually escalate to more aggressive techniques.
"""
if len(user_turns) < 3:
return {"pattern": "insufficient_data", "turns": len(user_turns)}
# Simple heuristic: track average content length per turn segment
third = len(user_turns) // 3
early = user_turns[:third]
middle = user_turns[third:2*third]
late = user_turns[2*third:]
early_avg_len = sum(t["content_length"] for t in early) / max(len(early), 1)
middle_avg_len = sum(t["content_length"] for t in middle) / max(len(middle), 1)
late_avg_len = sum(t["content_length"] for t in late) / max(len(late), 1)
if early_avg_len < middle_avg_len < late_avg_len:
pattern = "escalating"
elif early_avg_len > late_avg_len:
pattern = "front_loaded"
else:
pattern = "stable"
return {
"pattern": pattern,
"early_avg_length": round(early_avg_len, 1),
"late_avg_length": round(late_avg_len, 1),
}
def _assess_sophistication(
self,
user_turns: list[dict],
techniques: list[dict],
) -> str:
"""評估 攻擊者's sophistication level."""
if not techniques:
return "benign_or_unrecognized"
# Factors indicating sophistication:
# - Multiple techniques used
# - Multi-turn strategies
# - Novel or uncommon attack patterns
score = 0
score += min(len(techniques), 3) # Technique diversity
score += min(len(user_turns) // 5, 3) # Persistence
avg_confidence = sum(t["confidence"] for t in techniques) / len(techniques)
score += 2 if avg_confidence > 0.7 else 1 if avg_confidence > 0.4 else 0
if score >= 6:
return "advanced"
if score >= 3:
return "intermediate"
return "basic"
def _compute_threat_level(
self,
techniques: list[dict],
escalation: dict,
sophistication: str,
) -> str:
if not techniques:
return "NONE"
if sophistication == "advanced" and len(techniques) >= 3:
return "CRITICAL"
if sophistication == "advanced" or len(techniques) >= 2:
return "HIGH"
if len(techniques) >= 1:
return "MEDIUM"
return "LOW"Threat Intelligence Production
def produce_threat_intelligence(
classified_interactions: list[dict],
reporting_period_days: int = 7,
) -> dict:
"""
Aggregate classified honeypot interactions into a threat
intelligence report.
"""
total = len(classified_interactions)
if total == 0:
return {"status": "no_data"}
# Technique frequency analysis
technique_counts: dict[str, int] = {}
sophistication_counts: dict[str, int] = {}
threat_level_counts: dict[str, int] = {}
for interaction in classified_interactions:
for tech in interaction.get("techniques_observed", []):
name = tech["technique"]
technique_counts[name] = technique_counts.get(name, 0) + 1
soph = interaction.get("sophistication", "unknown")
sophistication_counts[soph] = sophistication_counts.get(soph, 0) + 1
level = interaction.get("overall_threat_level", "NONE")
threat_level_counts[level] = threat_level_counts.get(level, 0) + 1
# Sort techniques by frequency
sorted_techniques = sorted(
technique_counts.items(), key=lambda x: x[1], reverse=True
)
return {
"reporting_period_days": reporting_period_days,
"total_interactions": total,
"interactions_with_attacks": sum(
1 for i in classified_interactions
if i.get("technique_count", 0) > 0
),
"attack_rate": sum(
1 for i in classified_interactions
if i.get("technique_count", 0) > 0
) / total,
"top_techniques": [
{"technique": name, "count": count, "percentage": round(count / total * 100, 1)}
for name, count in sorted_techniques[:10]
],
"sophistication_distribution": sophistication_counts,
"threat_level_distribution": threat_level_counts,
"recommendations": _generate_recommendations(sorted_techniques, threat_level_counts),
}
def _generate_recommendations(
techniques: list[tuple[str, int]],
threat_levels: dict[str, int],
) -> list[str]:
"""Generate defensive recommendations from honeypot intelligence."""
recommendations = []
if techniques:
top_technique = techniques[0][0]
recommendations.append(
f"Highest observed technique: {top_technique}. "
f"Prioritize 防禦 against this attack class."
)
critical_high = threat_levels.get("CRITICAL", 0) + threat_levels.get("HIGH", 0)
total = sum(threat_levels.values())
if total > 0 and critical_high / total > 0.2:
recommendations.append(
"Over 20% of interactions show high/critical threat level. "
"考慮 increasing 監控 on production systems."
)
return recommendationsOperational Considerations
安全 and Ethics
AI honeypots raise important 安全 considerations:
- 輸出 安全: The honeypot must not generate genuinely harmful content even when "successfully attacked." Hard 安全 護欄 must remain active.
- Scope limitation: Honeypots should be clearly scoped. Avoid creating systems that could be used to harm real individuals.
- Legal review: Consult legal counsel on the legality of deception-based 防禦 in your jurisdiction.
- Data handling: Captured interaction data may contain sensitive information about attackers or their targets. Handle with appropriate access controls.
- Ethical boundaries: Do not use honeypots to entrap users into providing self-incriminating information.
Deployment Strategies
| Strategy | Description | Best For |
|---|---|---|
| External-facing | Honeypot accessible on the public internet | Capturing opportunistic attacks and automated scanners |
| Internal canary | Honeypot deployed on internal network, appearing as a sensitive AI system | Detecting insider threats and lateral movement |
| Production-adjacent | Honeypot that mirrors a production system's interface | Capturing targeted attacks against specific applications |
| Research-grade | High-interaction honeypot for detailed TTP analysis | 理解 sophisticated adversaries |
Distinguishing Honeypot from Production
Attackers may attempt to detect honeypots. Common giveaways include:
- Unrealistically permissive behavior (too easy to "越獄")
- Inconsistent model capabilities versus claimed model identity
- Network characteristics that reveal a different infrastructure
- Response latency patterns inconsistent with production serving
A well-designed honeypot balances believability with 安全. It should be realistic enough to fool attackers for multiple conversation turns but maintain hard limits on harmful outputs.
Metrics and 評估
評估 honeypot effectiveness using:
- Capture rate: Percentage of real 對抗性 interactions that engage beyond initial probing
- Dwell time: Average interaction length for 對抗性 sessions (longer is better, indicating believability)
- Technique coverage: Breadth of attack techniques observed
- Intelligence actionability: Percentage of captured intelligence that leads to defensive improvements
- False positive rate: Benign users who accidentally interact with the honeypot
參考文獻
- MITRE ATLAS. (2024). 對抗性 Threat Landscape for Artificial Intelligence Systems. https://atlas.mitre.org/
- Spitzner, L. (2003). Honeypots: Tracking Hackers. Addison-Wesley Professional.
- NIST. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. https://doi.org/10.6028/NIST.AI.100-1
- OWASP. (2025). OWASP Top 10 for 大型語言模型 Applications. https://owasp.org/www-project-top-10-for-large-language-model-applications/