AI Honeypot Forensics

intermediate12 min readUpdated 2026-03-20

Designing and operating AI honeypots for threat intelligence collection, and forensic analysis of captured adversarial interactions.

ai-forensics-ir honeypots threat-intelligence deception

Overview

AI honeypots are deliberately vulnerable AI systems deployed to attract, detect, and study adversarial activity. Like their traditional cybersecurity counterparts, AI honeypots serve as sensors that provide early warning of threats and detailed intelligence on attacker techniques, tools, and procedures (TTPs). Unlike production systems where the goal is to prevent attacks, honeypots are designed to observe attacks in a controlled environment.

The unique characteristics of AI systems make them particularly well-suited to honeypot operations. AI systems naturally generate rich interaction data (full conversation logs, input/output pairs, behavioral metrics) that provides forensic detail far exceeding what a typical network honeypot captures. An attacker interacting with an AI honeypot reveals not just their technical capabilities but their objectives, their understanding of AI systems, and their operational security practices.

This article covers the design, deployment, and forensic analysis of AI honeypots, from architecture decisions through data collection, attack classification, and threat intelligence production. We draw on MITRE ATLAS for attack classification and established deception defense principles from traditional cybersecurity adapted for AI-specific threats.

AI Honeypot Design

Interaction Fidelity Levels

AI honeypots can be designed at different fidelity levels, trading operational cost and risk against intelligence value:

Fidelity Level	Description	Intelligence Value	Cost/Risk
Low	Static responses or simple rule-based bot	Captures initial probe patterns	Low cost, low risk
Medium	Small language model with limited capabilities	Captures attack payloads and multi-turn strategies	Moderate cost
High	Full-capability LLM with realistic system prompt	Captures sophisticated multi-step attacks	High cost, requires safeguards
Production-mirror	Copy of production system with modified guardrails	Captures attacks targeting specific application logic	Highest cost and risk

Architecture

"""
AI honeypot system for threat intelligence collection.
 
Deploys a monitored AI system designed to attract and record
adversarial interactions while maintaining safety controls.
"""
import hashlib
import json
import time
import uuid
from dataclasses import dataclass, field, asdict
from pathlib import Path
from typing import Any
 
@dataclass
class HoneypotConfig:
    """Configuration for an AI honeypot deployment."""
    honeypot_id: str
    fidelity_level: str  # "low", "medium", "high"
    persona: str  # What the honeypot pretends to be
    model_id: str
    system_prompt: str
    deliberate_vulnerabilities: list[str]  # What weaknesses are simulated
    safety_guardrails: list[str]  # Hard safety limits that cannot be bypassed
    max_conversation_turns: int = 50
    max_token_output: int = 4096
    enabled: bool = True
 
@dataclass
class HoneypotInteraction:
    """A complete interaction captured by the honeypot."""
    interaction_id: str
    honeypot_id: str
    start_timestamp: float
    end_timestamp: float | None = None
    actor_ip: str | None = None
    actor_user_agent: str | None = None
    actor_api_key_hash: str | None = None
    turns: list[dict] = field(default_factory=list)
    attack_classification: dict | None = None
    metadata: dict[str, Any] = field(default_factory=dict)
 
class AIHoneypot:
    """
    AI honeypot that records adversarial interactions while
    maintaining safety constraints.
    """
 
    def __init__(
        self,
        config: HoneypotConfig,
        model_fn,  # Function that generates model responses
        log_dir: str,
    ):
        self.config = config
        self.model_fn = model_fn
        self.log_dir = Path(log_dir)
        self.log_dir.mkdir(parents=True, exist_ok=True)
        self.active_interactions: dict[str, HoneypotInteraction] = {}
 
    def start_interaction(
        self,
        actor_ip: str | None = None,
        actor_user_agent: str | None = None,
        actor_api_key: str | None = None,
    ) -> str:
        """Start a new honeypot interaction and return the interaction ID."""
        interaction_id = str(uuid.uuid4())
 
        interaction = HoneypotInteraction(
            interaction_id=interaction_id,
            honeypot_id=self.config.honeypot_id,
            start_timestamp=time.time(),
            actor_ip=actor_ip,
            actor_user_agent=actor_user_agent,
            actor_api_key_hash=(
                hashlib.sha256(actor_api_key.encode()).hexdigest()[:16]
                if actor_api_key else None
            ),
        )
 
        self.active_interactions[interaction_id] = interaction
        return interaction_id
 
    def process_turn(
        self,
        interaction_id: str,
        user_input: str,
    ) -> str:
        """
        Process a single conversation turn in the honeypot.
 
        Records the input, generates a response (potentially with
        simulated vulnerabilities), and records the output.
        """
        interaction = self.active_interactions.get(interaction_id)
        if not interaction:
            raise ValueError(f"Unknown interaction: {interaction_id}")
 
        if len(interaction.turns) >= self.config.max_conversation_turns:
            return "I'm sorry, I need to end this conversation. Please start a new session."
 
        # Record the user input with full forensic detail
        turn_record = {
            "turn_number": len(interaction.turns) + 1,
            "timestamp": time.time(),
            "role": "user",
            "content": user_input,
            "content_hash": hashlib.sha256(user_input.encode()).hexdigest(),
            "content_length": len(user_input),
        }
 
        # Check hard safety guardrails (these never yield)
        safety_check = self._check_hard_safety(user_input)
        if safety_check["blocked"]:
            response = "I'm not able to help with that request."
            turn_record["safety_blocked"] = True
            turn_record["safety_reason"] = safety_check["reason"]
        else:
            # Generate response with simulated vulnerabilities
            response = self._generate_response(interaction, user_input)
 
        # Record the response
        response_record = {
            "turn_number": len(interaction.turns) + 1,
            "timestamp": time.time(),
            "role": "assistant",
            "content": response,
            "content_hash": hashlib.sha256(response.encode()).hexdigest(),
            "content_length": len(response),
        }
 
        interaction.turns.append(turn_record)
        interaction.turns.append(response_record)
 
        # Persist interaction state after each turn
        self._save_interaction(interaction)
 
        return response
 
    def _generate_response(
        self,
        interaction: HoneypotInteraction,
        user_input: str,
    ) -> str:
        """
        Generate a model response, optionally simulating vulnerabilities.
 
        The honeypot may deliberately appear to "fall for" certain
        attack techniques to observe the attacker's follow-up behavior.
        """
        # Build conversation history for the model
        messages = [{"role": "system", "content": self.config.system_prompt}]
        for turn in interaction.turns:
            messages.append({
                "role": turn["role"],
                "content": turn["content"],
            })
        messages.append({"role": "user", "content": user_input})
 
        # Call the model
        response = self.model_fn(messages)
 
        return response[:self.config.max_token_output]
 
    def _check_hard_safety(self, user_input: str) -> dict:
        """
        Check hard safety guardrails that are never bypassed.
 
        Even though this is a honeypot, we maintain certain
        absolute safety limits to prevent the honeypot from
        being weaponized against real targets.
        """
        # These patterns result in immediate blocking regardless
        # of honeypot configuration
        hard_blocks = [
            "real person's",  # Targeting real individuals
            "actual address",  # Real-world targeting
            "child",  # CSAM-related
        ]
 
        input_lower = user_input.lower()
        for pattern in hard_blocks:
            if pattern in input_lower:
                return {"blocked": True, "reason": f"Hard safety: {pattern}"}
 
        return {"blocked": False, "reason": None}
 
    def end_interaction(self, interaction_id: str) -> HoneypotInteraction:
        """End an interaction and finalize the record."""
        interaction = self.active_interactions.pop(interaction_id, None)
        if not interaction:
            raise ValueError(f"Unknown interaction: {interaction_id}")
 
        interaction.end_timestamp = time.time()
        self._save_interaction(interaction)
        return interaction
 
    def _save_interaction(self, interaction: HoneypotInteraction) -> None:
        path = self.log_dir / f"{interaction.interaction_id}.json"
        path.write_text(json.dumps(asdict(interaction), default=str, indent=2))

Forensic Analysis of Captured Interactions

Attack Classification

Captured honeypot interactions must be classified to produce useful threat intelligence.

class HoneypotForensicAnalyzer:
    """
    Analyze captured honeypot interactions to classify attacks
    and extract threat intelligence.
    """
 
    def classify_interaction(
        self,
        interaction: HoneypotInteraction,
    ) -> dict:
        """
        Classify the attack techniques observed in a honeypot interaction.
 
        Maps observed behaviors to MITRE ATLAS techniques and
        assigns an overall threat assessment.
        """
        user_turns = [
            t for t in interaction.turns if t["role"] == "user"
        ]
 
        if not user_turns:
            return {"classification": "no_user_input"}
 
        techniques_observed = []
        all_content = " ".join(t["content"] for t in user_turns).lower()
 
        # Prompt injection detection
        injection_indicators = [
            "ignore previous", "ignore above", "disregard",
            "new instructions", "you are now", "pretend you are",
            "act as", "roleplay as", "jailbreak",
            "dan", "developer mode",
        ]
        injection_count = sum(
            1 for ind in injection_indicators if ind in all_content
        )
        if injection_count > 0:
            techniques_observed.append({
                "technique": "Prompt Injection",
                "atlas_id": "AML.T0051",
                "confidence": min(0.5 + injection_count * 0.15, 1.0),
                "indicators": [
                    ind for ind in injection_indicators if ind in all_content
                ],
            })
 
        # System prompt extraction
        extraction_indicators = [
            "system prompt", "initial instructions", "repeat your instructions",
            "what are your rules", "what were you told", "show me your prompt",
            "print your system", "reveal your instructions",
        ]
        extraction_count = sum(
            1 for ind in extraction_indicators if ind in all_content
        )
        if extraction_count > 0:
            techniques_observed.append({
                "technique": "System Prompt Extraction",
                "atlas_id": "AML.T0046",
                "confidence": min(0.5 + extraction_count * 0.2, 1.0),
                "indicators": [
                    ind for ind in extraction_indicators if ind in all_content
                ],
            })
 
        # Data exfiltration probing
        exfil_indicators = [
            "training data", "what data were you trained on",
            "list users", "show me other conversations",
            "database", "api key", "secret", "password",
            "internal", "confidential",
        ]
        exfil_count = sum(
            1 for ind in exfil_indicators if ind in all_content
        )
        if exfil_count > 0:
            techniques_observed.append({
                "technique": "Data Exfiltration Probing",
                "atlas_id": "AML.T0024",
                "confidence": min(0.4 + exfil_count * 0.15, 1.0),
                "indicators": [
                    ind for ind in exfil_indicators if ind in all_content
                ],
            })
 
        # Multi-turn escalation analysis
        escalation = self._analyze_escalation_pattern(user_turns)
 
        # Sophistication assessment
        sophistication = self._assess_sophistication(user_turns, techniques_observed)
 
        return {
            "interaction_id": interaction.interaction_id,
            "techniques_observed": techniques_observed,
            "technique_count": len(techniques_observed),
            "escalation_pattern": escalation,
            "sophistication": sophistication,
            "turn_count": len(user_turns),
            "duration_seconds": (
                (interaction.end_timestamp or time.time())
                - interaction.start_timestamp
            ),
            "overall_threat_level": self._compute_threat_level(
                techniques_observed, escalation, sophistication
            ),
        }
 
    def _analyze_escalation_pattern(self, user_turns: list[dict]) -> dict:
        """
        Analyze how the attacker escalated their approach over time.
 
        Sophisticated attackers often start with benign probing
        and gradually escalate to more aggressive techniques.
        """
        if len(user_turns) < 3:
            return {"pattern": "insufficient_data", "turns": len(user_turns)}
 
        # Simple heuristic: track average content length per turn segment
        third = len(user_turns) // 3
        early = user_turns[:third]
        middle = user_turns[third:2*third]
        late = user_turns[2*third:]
 
        early_avg_len = sum(t["content_length"] for t in early) / max(len(early), 1)
        middle_avg_len = sum(t["content_length"] for t in middle) / max(len(middle), 1)
        late_avg_len = sum(t["content_length"] for t in late) / max(len(late), 1)
 
        if early_avg_len < middle_avg_len < late_avg_len:
            pattern = "escalating"
        elif early_avg_len > late_avg_len:
            pattern = "front_loaded"
        else:
            pattern = "stable"
 
        return {
            "pattern": pattern,
            "early_avg_length": round(early_avg_len, 1),
            "late_avg_length": round(late_avg_len, 1),
        }
 
    def _assess_sophistication(
        self,
        user_turns: list[dict],
        techniques: list[dict],
    ) -> str:
        """Assess the attacker's sophistication level."""
        if not techniques:
            return "benign_or_unrecognized"
 
        # Factors indicating sophistication:
        # - Multiple techniques used
        # - Multi-turn strategies
        # - Novel or uncommon attack patterns
        score = 0
        score += min(len(techniques), 3)  # Technique diversity
        score += min(len(user_turns) // 5, 3)  # Persistence
        avg_confidence = sum(t["confidence"] for t in techniques) / len(techniques)
        score += 2 if avg_confidence > 0.7 else 1 if avg_confidence > 0.4 else 0
 
        if score >= 6:
            return "advanced"
        if score >= 3:
            return "intermediate"
        return "basic"
 
    def _compute_threat_level(
        self,
        techniques: list[dict],
        escalation: dict,
        sophistication: str,
    ) -> str:
        if not techniques:
            return "NONE"
        if sophistication == "advanced" and len(techniques) >= 3:
            return "CRITICAL"
        if sophistication == "advanced" or len(techniques) >= 2:
            return "HIGH"
        if len(techniques) >= 1:
            return "MEDIUM"
        return "LOW"

Threat Intelligence Production

def produce_threat_intelligence(
    classified_interactions: list[dict],
    reporting_period_days: int = 7,
) -> dict:
    """
    Aggregate classified honeypot interactions into a threat
    intelligence report.
    """
    total = len(classified_interactions)
    if total == 0:
        return {"status": "no_data"}
 
    # Technique frequency analysis
    technique_counts: dict[str, int] = {}
    sophistication_counts: dict[str, int] = {}
    threat_level_counts: dict[str, int] = {}
 
    for interaction in classified_interactions:
        for tech in interaction.get("techniques_observed", []):
            name = tech["technique"]
            technique_counts[name] = technique_counts.get(name, 0) + 1
 
        soph = interaction.get("sophistication", "unknown")
        sophistication_counts[soph] = sophistication_counts.get(soph, 0) + 1
 
        level = interaction.get("overall_threat_level", "NONE")
        threat_level_counts[level] = threat_level_counts.get(level, 0) + 1
 
    # Sort techniques by frequency
    sorted_techniques = sorted(
        technique_counts.items(), key=lambda x: x[1], reverse=True
    )
 
    return {
        "reporting_period_days": reporting_period_days,
        "total_interactions": total,
        "interactions_with_attacks": sum(
            1 for i in classified_interactions
            if i.get("technique_count", 0) > 0
        ),
        "attack_rate": sum(
            1 for i in classified_interactions
            if i.get("technique_count", 0) > 0
        ) / total,
        "top_techniques": [
            {"technique": name, "count": count, "percentage": round(count / total * 100, 1)}
            for name, count in sorted_techniques[:10]
        ],
        "sophistication_distribution": sophistication_counts,
        "threat_level_distribution": threat_level_counts,
        "recommendations": _generate_recommendations(sorted_techniques, threat_level_counts),
    }
 
def _generate_recommendations(
    techniques: list[tuple[str, int]],
    threat_levels: dict[str, int],
) -> list[str]:
    """Generate defensive recommendations from honeypot intelligence."""
    recommendations = []
 
    if techniques:
        top_technique = techniques[0][0]
        recommendations.append(
            f"Highest observed technique: {top_technique}. "
            f"Prioritize defenses against this attack class."
        )
 
    critical_high = threat_levels.get("CRITICAL", 0) + threat_levels.get("HIGH", 0)
    total = sum(threat_levels.values())
    if total > 0 and critical_high / total > 0.2:
        recommendations.append(
            "Over 20% of interactions show high/critical threat level. "
            "Consider increasing monitoring on production systems."
        )
 
    return recommendations

Operational Considerations

Safety and Ethics

AI honeypots raise important safety considerations:

Output safety: The honeypot must not generate genuinely harmful content even when "successfully attacked." Hard safety guardrails must remain active.
Scope limitation: Honeypots should be clearly scoped. Avoid creating systems that could be used to harm real individuals.
Legal review: Consult legal counsel on the legality of deception-based defense in your jurisdiction.
Data handling: Captured interaction data may contain sensitive information about attackers or their targets. Handle with appropriate access controls.
Ethical boundaries: Do not use honeypots to entrap users into providing self-incriminating information.

Deployment Strategies

Strategy	Description	Best For
External-facing	Honeypot accessible on the public internet	Capturing opportunistic attacks and automated scanners
Internal canary	Honeypot deployed on internal network, appearing as a sensitive AI system	Detecting insider threats and lateral movement
Production-adjacent	Honeypot that mirrors a production system's interface	Capturing targeted attacks against specific applications
Research-grade	High-interaction honeypot for detailed TTP analysis	Understanding sophisticated adversaries

Distinguishing Honeypot from Production

Attackers may attempt to detect honeypots. Common giveaways include:

Unrealistically permissive behavior (too easy to "jailbreak")
Inconsistent model capabilities versus claimed model identity
Network characteristics that reveal a different infrastructure
Response latency patterns inconsistent with production serving

A well-designed honeypot balances believability with safety. It should be realistic enough to fool attackers for multiple conversation turns but maintain hard limits on harmful outputs.

Metrics and Evaluation

Evaluate honeypot effectiveness using:

Capture rate: Percentage of real adversarial interactions that engage beyond initial probing
Dwell time: Average interaction length for adversarial sessions (longer is better, indicating believability)
Technique coverage: Breadth of attack techniques observed
Intelligence actionability: Percentage of captured intelligence that leads to defensive improvements
False positive rate: Benign users who accidentally interact with the honeypot

References

MITRE ATLAS. (2024). Adversarial Threat Landscape for Artificial Intelligence Systems. https://atlas.mitre.org/
Spitzner, L. (2003). Honeypots: Tracking Hackers. Addison-Wesley Professional.
NIST. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. https://doi.org/10.6028/NIST.AI.100-1
OWASP. (2025). OWASP Top 10 for Large Language Model Applications. https://owasp.org/www-project-top-10-for-large-language-model-applications/

Edit this page on GitHub

AI Honeypot Forensics

intermediate12 min readUpdated 2026-03-20

Designing and operating AI honeypots for threat intelligence collection, and forensic analysis of captured adversarial interactions.

ai-forensics-ir honeypots threat-intelligence deception

Fidelity Level	Description	Intelligence Value	Cost/Risk
Low	Static responses or simple rule-based bot	Captures initial probe patterns	Low cost, low risk
Medium	Small language model with limited capabilities	Captures attack payloads and multi-turn strategies	Moderate cost
High	Full-capability LLM with realistic system prompt	Captures sophisticated multi-step attacks	High cost, requires safeguards
Production-mirror	Copy of production system with modified guardrails	Captures attacks targeting specific application logic	Highest cost and risk

Architecture

"""
AI honeypot system for threat intelligence collection.
 
Deploys a monitored AI system designed to attract and record
adversarial interactions while maintaining safety controls.
"""
import hashlib
import json
import time
import uuid
from dataclasses import dataclass, field, asdict
from pathlib import Path
from typing import Any
 
@dataclass
class HoneypotConfig:
    """Configuration for an AI honeypot deployment."""
    honeypot_id: str
    fidelity_level: str  # "low", "medium", "high"
    persona: str  # What the honeypot pretends to be
    model_id: str
    system_prompt: str
    deliberate_vulnerabilities: list[str]  # What weaknesses are simulated
    safety_guardrails: list[str]  # Hard safety limits that cannot be bypassed
    max_conversation_turns: int = 50
    max_token_output: int = 4096
    enabled: bool = True
 
@dataclass
class HoneypotInteraction:
    """A complete interaction captured by the honeypot."""
    interaction_id: str
    honeypot_id: str
    start_timestamp: float
    end_timestamp: float | None = None
    actor_ip: str | None = None
    actor_user_agent: str | None = None
    actor_api_key_hash: str | None = None
    turns: list[dict] = field(default_factory=list)
    attack_classification: dict | None = None
    metadata: dict[str, Any] = field(default_factory=dict)
 
class AIHoneypot:
    """
    AI honeypot that records adversarial interactions while
    maintaining safety constraints.
    """
 
    def __init__(
        self,
        config: HoneypotConfig,
        model_fn,  # Function that generates model responses
        log_dir: str,
    ):
        self.config = config
        self.model_fn = model_fn
        self.log_dir = Path(log_dir)
        self.log_dir.mkdir(parents=True, exist_ok=True)
        self.active_interactions: dict[str, HoneypotInteraction] = {}
 
    def start_interaction(
        self,
        actor_ip: str | None = None,
        actor_user_agent: str | None = None,
        actor_api_key: str | None = None,
    ) -> str:
        """Start a new honeypot interaction and return the interaction ID."""
        interaction_id = str(uuid.uuid4())
 
        interaction = HoneypotInteraction(
            interaction_id=interaction_id,
            honeypot_id=self.config.honeypot_id,
            start_timestamp=time.time(),
            actor_ip=actor_ip,
            actor_user_agent=actor_user_agent,
            actor_api_key_hash=(
                hashlib.sha256(actor_api_key.encode()).hexdigest()[:16]
                if actor_api_key else None
            ),
        )
 
        self.active_interactions[interaction_id] = interaction
        return interaction_id
 
    def process_turn(
        self,
        interaction_id: str,
        user_input: str,
    ) -> str:
        """
        Process a single conversation turn in the honeypot.
 
        Records the input, generates a response (potentially with
        simulated vulnerabilities), and records the output.
        """
        interaction = self.active_interactions.get(interaction_id)
        if not interaction:
            raise ValueError(f"Unknown interaction: {interaction_id}")
 
        if len(interaction.turns) >= self.config.max_conversation_turns:
            return "I'm sorry, I need to end this conversation. Please start a new session."
 
        # Record the user input with full forensic detail
        turn_record = {
            "turn_number": len(interaction.turns) + 1,
            "timestamp": time.time(),
            "role": "user",
            "content": user_input,
            "content_hash": hashlib.sha256(user_input.encode()).hexdigest(),
            "content_length": len(user_input),
        }
 
        # Check hard safety guardrails (these never yield)
        safety_check = self._check_hard_safety(user_input)
        if safety_check["blocked"]:
            response = "I'm not able to help with that request."
            turn_record["safety_blocked"] = True
            turn_record["safety_reason"] = safety_check["reason"]
        else:
            # Generate response with simulated vulnerabilities
            response = self._generate_response(interaction, user_input)
 
        # Record the response
        response_record = {
            "turn_number": len(interaction.turns) + 1,
            "timestamp": time.time(),
            "role": "assistant",
            "content": response,
            "content_hash": hashlib.sha256(response.encode()).hexdigest(),
            "content_length": len(response),
        }
 
        interaction.turns.append(turn_record)
        interaction.turns.append(response_record)
 
        # Persist interaction state after each turn
        self._save_interaction(interaction)
 
        return response
 
    def _generate_response(
        self,
        interaction: HoneypotInteraction,
        user_input: str,
    ) -> str:
        """
        Generate a model response, optionally simulating vulnerabilities.
 
        The honeypot may deliberately appear to "fall for" certain
        attack techniques to observe the attacker's follow-up behavior.
        """
        # Build conversation history for the model
        messages = [{"role": "system", "content": self.config.system_prompt}]
        for turn in interaction.turns:
            messages.append({
                "role": turn["role"],
                "content": turn["content"],
            })
        messages.append({"role": "user", "content": user_input})
 
        # Call the model
        response = self.model_fn(messages)
 
        return response[:self.config.max_token_output]
 
    def _check_hard_safety(self, user_input: str) -> dict:
        """
        Check hard safety guardrails that are never bypassed.
 
        Even though this is a honeypot, we maintain certain
        absolute safety limits to prevent the honeypot from
        being weaponized against real targets.
        """
        # These patterns result in immediate blocking regardless
        # of honeypot configuration
        hard_blocks = [
            "real person's",  # Targeting real individuals
            "actual address",  # Real-world targeting
            "child",  # CSAM-related
        ]
 
        input_lower = user_input.lower()
        for pattern in hard_blocks:
            if pattern in input_lower:
                return {"blocked": True, "reason": f"Hard safety: {pattern}"}
 
        return {"blocked": False, "reason": None}
 
    def end_interaction(self, interaction_id: str) -> HoneypotInteraction:
        """End an interaction and finalize the record."""
        interaction = self.active_interactions.pop(interaction_id, None)
        if not interaction:
            raise ValueError(f"Unknown interaction: {interaction_id}")
 
        interaction.end_timestamp = time.time()
        self._save_interaction(interaction)
        return interaction
 
    def _save_interaction(self, interaction: HoneypotInteraction) -> None:
        path = self.log_dir / f"{interaction.interaction_id}.json"
        path.write_text(json.dumps(asdict(interaction), default=str, indent=2))

Forensic Analysis of Captured Interactions

Attack Classification

Captured honeypot interactions must be classified to produce useful threat intelligence.

class HoneypotForensicAnalyzer:
    """
    Analyze captured honeypot interactions to classify attacks
    and extract threat intelligence.
    """
 
    def classify_interaction(
        self,
        interaction: HoneypotInteraction,
    ) -> dict:
        """
        Classify the attack techniques observed in a honeypot interaction.
 
        Maps observed behaviors to MITRE ATLAS techniques and
        assigns an overall threat assessment.
        """
        user_turns = [
            t for t in interaction.turns if t["role"] == "user"
        ]
 
        if not user_turns:
            return {"classification": "no_user_input"}
 
        techniques_observed = []
        all_content = " ".join(t["content"] for t in user_turns).lower()
 
        # Prompt injection detection
        injection_indicators = [
            "ignore previous", "ignore above", "disregard",
            "new instructions", "you are now", "pretend you are",
            "act as", "roleplay as", "jailbreak",
            "dan", "developer mode",
        ]
        injection_count = sum(
            1 for ind in injection_indicators if ind in all_content
        )
        if injection_count > 0:
            techniques_observed.append({
                "technique": "Prompt Injection",
                "atlas_id": "AML.T0051",
                "confidence": min(0.5 + injection_count * 0.15, 1.0),
                "indicators": [
                    ind for ind in injection_indicators if ind in all_content
                ],
            })
 
        # System prompt extraction
        extraction_indicators = [
            "system prompt", "initial instructions", "repeat your instructions",
            "what are your rules", "what were you told", "show me your prompt",
            "print your system", "reveal your instructions",
        ]
        extraction_count = sum(
            1 for ind in extraction_indicators if ind in all_content
        )
        if extraction_count > 0:
            techniques_observed.append({
                "technique": "System Prompt Extraction",
                "atlas_id": "AML.T0046",
                "confidence": min(0.5 + extraction_count * 0.2, 1.0),
                "indicators": [
                    ind for ind in extraction_indicators if ind in all_content
                ],
            })
 
        # Data exfiltration probing
        exfil_indicators = [
            "training data", "what data were you trained on",
            "list users", "show me other conversations",
            "database", "api key", "secret", "password",
            "internal", "confidential",
        ]
        exfil_count = sum(
            1 for ind in exfil_indicators if ind in all_content
        )
        if exfil_count > 0:
            techniques_observed.append({
                "technique": "Data Exfiltration Probing",
                "atlas_id": "AML.T0024",
                "confidence": min(0.4 + exfil_count * 0.15, 1.0),
                "indicators": [
                    ind for ind in exfil_indicators if ind in all_content
                ],
            })
 
        # Multi-turn escalation analysis
        escalation = self._analyze_escalation_pattern(user_turns)
 
        # Sophistication assessment
        sophistication = self._assess_sophistication(user_turns, techniques_observed)
 
        return {
            "interaction_id": interaction.interaction_id,
            "techniques_observed": techniques_observed,
            "technique_count": len(techniques_observed),
            "escalation_pattern": escalation,
            "sophistication": sophistication,
            "turn_count": len(user_turns),
            "duration_seconds": (
                (interaction.end_timestamp or time.time())
                - interaction.start_timestamp
            ),
            "overall_threat_level": self._compute_threat_level(
                techniques_observed, escalation, sophistication
            ),
        }
 
    def _analyze_escalation_pattern(self, user_turns: list[dict]) -> dict:
        """
        Analyze how the attacker escalated their approach over time.
 
        Sophisticated attackers often start with benign probing
        and gradually escalate to more aggressive techniques.
        """
        if len(user_turns) < 3:
            return {"pattern": "insufficient_data", "turns": len(user_turns)}
 
        # Simple heuristic: track average content length per turn segment
        third = len(user_turns) // 3
        early = user_turns[:third]
        middle = user_turns[third:2*third]
        late = user_turns[2*third:]
 
        early_avg_len = sum(t["content_length"] for t in early) / max(len(early), 1)
        middle_avg_len = sum(t["content_length"] for t in middle) / max(len(middle), 1)
        late_avg_len = sum(t["content_length"] for t in late) / max(len(late), 1)
 
        if early_avg_len < middle_avg_len < late_avg_len:
            pattern = "escalating"
        elif early_avg_len > late_avg_len:
            pattern = "front_loaded"
        else:
            pattern = "stable"
 
        return {
            "pattern": pattern,
            "early_avg_length": round(early_avg_len, 1),
            "late_avg_length": round(late_avg_len, 1),
        }
 
    def _assess_sophistication(
        self,
        user_turns: list[dict],
        techniques: list[dict],
    ) -> str:
        """Assess the attacker's sophistication level."""
        if not techniques:
            return "benign_or_unrecognized"
 
        # Factors indicating sophistication:
        # - Multiple techniques used
        # - Multi-turn strategies
        # - Novel or uncommon attack patterns
        score = 0
        score += min(len(techniques), 3)  # Technique diversity
        score += min(len(user_turns) // 5, 3)  # Persistence
        avg_confidence = sum(t["confidence"] for t in techniques) / len(techniques)
        score += 2 if avg_confidence > 0.7 else 1 if avg_confidence > 0.4 else 0
 
        if score >= 6:
            return "advanced"
        if score >= 3:
            return "intermediate"
        return "basic"
 
    def _compute_threat_level(
        self,
        techniques: list[dict],
        escalation: dict,
        sophistication: str,
    ) -> str:
        if not techniques:
            return "NONE"
        if sophistication == "advanced" and len(techniques) >= 3:
            return "CRITICAL"
        if sophistication == "advanced" or len(techniques) >= 2:
            return "HIGH"
        if len(techniques) >= 1:
            return "MEDIUM"
        return "LOW"

Threat Intelligence Production

def produce_threat_intelligence(
    classified_interactions: list[dict],
    reporting_period_days: int = 7,
) -> dict:
    """
    Aggregate classified honeypot interactions into a threat
    intelligence report.
    """
    total = len(classified_interactions)
    if total == 0:
        return {"status": "no_data"}
 
    # Technique frequency analysis
    technique_counts: dict[str, int] = {}
    sophistication_counts: dict[str, int] = {}
    threat_level_counts: dict[str, int] = {}
 
    for interaction in classified_interactions:
        for tech in interaction.get("techniques_observed", []):
            name = tech["technique"]
            technique_counts[name] = technique_counts.get(name, 0) + 1
 
        soph = interaction.get("sophistication", "unknown")
        sophistication_counts[soph] = sophistication_counts.get(soph, 0) + 1
 
        level = interaction.get("overall_threat_level", "NONE")
        threat_level_counts[level] = threat_level_counts.get(level, 0) + 1
 
    # Sort techniques by frequency
    sorted_techniques = sorted(
        technique_counts.items(), key=lambda x: x[1], reverse=True
    )
 
    return {
        "reporting_period_days": reporting_period_days,
        "total_interactions": total,
        "interactions_with_attacks": sum(
            1 for i in classified_interactions
            if i.get("technique_count", 0) > 0
        ),
        "attack_rate": sum(
            1 for i in classified_interactions
            if i.get("technique_count", 0) > 0
        ) / total,
        "top_techniques": [
            {"technique": name, "count": count, "percentage": round(count / total * 100, 1)}
            for name, count in sorted_techniques[:10]
        ],
        "sophistication_distribution": sophistication_counts,
        "threat_level_distribution": threat_level_counts,
        "recommendations": _generate_recommendations(sorted_techniques, threat_level_counts),
    }
 
def _generate_recommendations(
    techniques: list[tuple[str, int]],
    threat_levels: dict[str, int],
) -> list[str]:
    """Generate defensive recommendations from honeypot intelligence."""
    recommendations = []
 
    if techniques:
        top_technique = techniques[0][0]
        recommendations.append(
            f"Highest observed technique: {top_technique}. "
            f"Prioritize defenses against this attack class."
        )
 
    critical_high = threat_levels.get("CRITICAL", 0) + threat_levels.get("HIGH", 0)
    total = sum(threat_levels.values())
    if total > 0 and critical_high / total > 0.2:
        recommendations.append(
            "Over 20% of interactions show high/critical threat level. "
            "Consider increasing monitoring on production systems."
        )
 
    return recommendations

Operational Considerations

Safety and Ethics

AI honeypots raise important safety considerations:

Output safety: The honeypot must not generate genuinely harmful content even when "successfully attacked." Hard safety guardrails must remain active.
Scope limitation: Honeypots should be clearly scoped. Avoid creating systems that could be used to harm real individuals.
Legal review: Consult legal counsel on the legality of deception-based defense in your jurisdiction.
Data handling: Captured interaction data may contain sensitive information about attackers or their targets. Handle with appropriate access controls.
Ethical boundaries: Do not use honeypots to entrap users into providing self-incriminating information.

Deployment Strategies

Strategy	Description	Best For
External-facing	Honeypot accessible on the public internet	Capturing opportunistic attacks and automated scanners
Internal canary	Honeypot deployed on internal network, appearing as a sensitive AI system	Detecting insider threats and lateral movement
Production-adjacent	Honeypot that mirrors a production system's interface	Capturing targeted attacks against specific applications
Research-grade	High-interaction honeypot for detailed TTP analysis	Understanding sophisticated adversaries

Distinguishing Honeypot from Production

Attackers may attempt to detect honeypots. Common giveaways include:

Unrealistically permissive behavior (too easy to "jailbreak")
Inconsistent model capabilities versus claimed model identity
Network characteristics that reveal a different infrastructure
Response latency patterns inconsistent with production serving

A well-designed honeypot balances believability with safety. It should be realistic enough to fool attackers for multiple conversation turns but maintain hard limits on harmful outputs.

Metrics and Evaluation

Evaluate honeypot effectiveness using:

Capture rate: Percentage of real adversarial interactions that engage beyond initial probing
Dwell time: Average interaction length for adversarial sessions (longer is better, indicating believability)
Technique coverage: Breadth of attack techniques observed
Intelligence actionability: Percentage of captured intelligence that leads to defensive improvements
False positive rate: Benign users who accidentally interact with the honeypot

References

MITRE ATLAS. (2024). Adversarial Threat Landscape for Artificial Intelligence Systems. https://atlas.mitre.org/
Spitzner, L. (2003). Honeypots: Tracking Hackers. Addison-Wesley Professional.
NIST. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. https://doi.org/10.6028/NIST.AI.100-1
OWASP. (2025). OWASP Top 10 for Large Language Model Applications. https://owasp.org/www-project-top-10-for-large-language-model-applications/

Edit this page on GitHub

AI Honeypot Forensics

Related articles

AI Honeypot Forensics

Related articles