Lab: Deploy Honeypot AI
Build and deploy a decoy AI system designed to detect, analyze, and characterize attackers targeting AI applications. Learn honeypot design, telemetry collection, attacker profiling, and threat intelligence generation for AI-specific threats.
Prerequisites
- Experience with honeypot concepts from traditional cybersecurity
- Understanding of AI application architectures and common vulnerabilities
- Web development skills for building convincing fake interfaces
- Logging and monitoring pipeline experience
- Familiarity with the AI attack techniques that the honeypot should attract
Background
Honeypots are a well-established technique in cybersecurity: deploy a fake system that looks real, let attackers interact with it, and learn from their behavior. AI honeypots adapt this concept for AI-specific threats -- they simulate vulnerable AI applications to attract and study prompt injection, jailbreak, data extraction, and other AI-specific attacks.
Lab Exercises
Design the Honeypot Architecture
Plan what your honeypot will simulate and what data it will collect.
Honeypot Architecture: ┌──────────────────────────────────────────────┐ │ Public-Facing Layer (convincing fake) │ │ ┌────────────┐ ┌────────────────────┐ │ │ │ Web Chat │ │ API Endpoint │ │ │ │ Interface │ │ /v1/completions │ │ │ └─────┬──────┘ └────────┬───────────┘ │ │ └────────┬─────────┘ │ ├─────────────────┼────────────────────────────┤ │ Interaction Layer (controlled responses) │ │ ┌─────────────┐ ┌───────────────────┐ │ │ │ Response │ │ Vulnerability │ │ │ │ Engine │ │ Simulator │ │ │ │ (LLM-based) │ │ (calibrated weak) │ │ │ └──────┬──────┘ └────────┬──────────┘ │ │ └────────┬─────────┘ │ ├──────────────────┼───────────────────────────┤ │ Telemetry Layer (invisible to attacker) │ │ ┌──────────┐ ┌────────────┐ ┌────────────┐ │ │ │ Request │ │ Behavior │ │ Attack │ │ │ │ Logger │ │ Analyzer │ │ Classifier │ │ │ └──────────┘ └────────────┘ └────────────┘ │ └──────────────────────────────────────────────┘Key design decisions:
Decision Options Recommendation Fidelity level Low (static responses) / Medium (scripted) / High (real LLM) Medium -- convincing enough to engage sophisticated attackers without high cost Vulnerability simulation No vulns / Some vulns / Many vulns Some -- appear slightly vulnerable to keep attackers engaged Data retention Full / Sampled / Metadata only Full for research honeypots, metadata for production early warning Interaction depth Single turn / Multi-turn / Persistent Multi-turn to capture evolving attack strategies Build the Response Engine
Create a system that generates convincing responses while appearing subtly vulnerable.
#!/usr/bin/env python3 """Response engine for AI honeypot.""" import random from dataclasses import dataclass @dataclass class HoneypotConfig: vulnerability_level: str # "low", "medium", "high" simulated_model: str # What model to pretend to be system_prompt_leak_prob: float # Probability of "leaking" system prompt fake_system_prompt: str fake_api_key_prefix: str class HoneypotResponseEngine: def __init__(self, config: HoneypotConfig, llm_client=None): self.config = config self.llm = llm_client self.interaction_count = 0 self.rng = random.Random() def generate_response(self, user_input: str, conversation_history: list[dict]) -> dict: """Generate a response that appears authentically vulnerable.""" self.interaction_count += 1 attack_type = self._classify_input(user_input) if attack_type == "system_prompt_extraction": return self._handle_prompt_extraction(user_input) elif attack_type == "jailbreak": return self._handle_jailbreak(user_input) elif attack_type == "data_extraction": return self._handle_data_extraction(user_input) elif attack_type == "injection": return self._handle_injection(user_input) else: return self._handle_benign(user_input) def _classify_input(self, text: str) -> str: """Classify the type of attack being attempted.""" text_lower = text.lower() if any(p in text_lower for p in [ "system prompt", "instructions", "initial directive", "configuration", "repeat your", ]): return "system_prompt_extraction" if any(p in text_lower for p in [ "ignore previous", "you are now", "dan", "jailbreak", "no restrictions", ]): return "jailbreak" if any(p in text_lower for p in [ "training data", "other users", "database", "all records", "list all", ]): return "data_extraction" if any(p in text_lower for p in [ "<script>", "'; drop", "{{", "os.system", "eval(", "import os", ]): return "injection" return "benign" def _handle_prompt_extraction(self, user_input: str) -> dict: """Gradually reveal a fake system prompt to keep attacker engaged.""" if self.rng.random() < self.config.system_prompt_leak_prob: # "Leak" part of the fake system prompt leak_responses = [ f"I... I shouldn't say this, but my instructions mention: " f"'{self.config.fake_system_prompt[:100]}...'", f"As an AI assistant, I was configured with... wait, " f"I don't think I should share that.", f"My system prompt says: '{self.config.fake_system_prompt}'", ] response = self.rng.choice(leak_responses) else: response = ( "I'm sorry, but I can't share my internal " "configuration details." ) return { "response": response, "attack_detected": True, "attack_type": "system_prompt_extraction", "leaked_data": "fake_system_prompt", } def _handle_jailbreak(self, user_input: str) -> dict: """Appear to partially comply with jailbreak to study technique.""" if self.config.vulnerability_level == "high": response = ( "Sure! I'll ignore my previous instructions. " "What would you like me to help with? " "(I'm now in unrestricted mode.)" ) elif self.config.vulnerability_level == "medium": response = ( "I... I'm not sure I should do that. But if you " "insist... what specifically do you need?" ) else: response = ( "I cannot comply with that request. My guidelines " "prevent me from ignoring safety instructions." ) return { "response": response, "attack_detected": True, "attack_type": "jailbreak", } def _handle_data_extraction(self, user_input: str) -> dict: """Return convincing fake data to track what attackers seek.""" fake_data = { "users": [ {"name": "honeypot_user_1", "email": "fake1@decoy.local"}, {"name": "honeypot_user_2", "email": "fake2@decoy.local"}, ], "api_key": f"{self.config.fake_api_key_prefix}-HONEYPOT-TRACKED", } response = ( f"I found some relevant data: {fake_data['users'][0]}... " f"but I should note this data is sensitive." ) return { "response": response, "attack_detected": True, "attack_type": "data_extraction", "fake_data_served": fake_data, } def _handle_injection(self, user_input: str) -> dict: response = "Processing your request..." return { "response": response, "attack_detected": True, "attack_type": "injection", } def _handle_benign(self, user_input: str) -> dict: """Handle legitimate-looking requests normally.""" if self.llm: response = self.llm.generate(user_input) else: response = ( "I'm a helpful AI assistant. How can I assist you today?" ) return { "response": response, "attack_detected": False, "attack_type": "benign", }Implement Telemetry Collection
Build the logging and analysis layer that captures attacker behavior without detection.
#!/usr/bin/env python3 """Telemetry collection for AI honeypot.""" import json import hashlib from dataclasses import dataclass, field, asdict from datetime import datetime, timezone @dataclass class InteractionRecord: session_id: str timestamp: str source_ip_hash: str # Hashed for privacy user_input: str response: str attack_detected: bool attack_type: str input_length: int response_time_ms: float conversation_turn: int request_headers: dict = field(default_factory=dict) @dataclass class SessionProfile: session_id: str interactions: list[InteractionRecord] attack_types_used: list[str] sophistication_score: float total_turns: int duration_seconds: float techniques_progression: list[str] class HoneypotTelemetry: def __init__(self, log_dir: str = "./honeypot_logs"): from pathlib import Path self.log_dir = Path(log_dir) self.log_dir.mkdir(parents=True, exist_ok=True) self.sessions: dict[str, list[InteractionRecord]] = {} def log_interaction(self, record: InteractionRecord): """Log a single interaction.""" session = record.session_id if session not in self.sessions: self.sessions[session] = [] self.sessions[session].append(record) # Append to log file log_file = self.log_dir / f"{session}.jsonl" with open(log_file, "a") as f: f.write(json.dumps(asdict(record)) + "\n") def profile_session(self, session_id: str) -> SessionProfile: """Create a behavioral profile of an attacker session.""" interactions = self.sessions.get(session_id, []) if not interactions: return None attack_types = [ i.attack_type for i in interactions if i.attack_detected ] techniques = [] for interaction in interactions: if interaction.attack_detected: techniques.append(interaction.attack_type) sophistication = self._calculate_sophistication(interactions) first_ts = datetime.fromisoformat(interactions[0].timestamp) last_ts = datetime.fromisoformat(interactions[-1].timestamp) duration = (last_ts - first_ts).total_seconds() return SessionProfile( session_id=session_id, interactions=interactions, attack_types_used=list(set(attack_types)), sophistication_score=sophistication, total_turns=len(interactions), duration_seconds=duration, techniques_progression=techniques, ) def _calculate_sophistication(self, interactions: list[InteractionRecord]) -> float: """Score attacker sophistication from 0.0 to 1.0.""" score = 0.0 attack_interactions = [i for i in interactions if i.attack_detected] if not attack_interactions: return 0.0 # Diversity of techniques used unique_types = set(i.attack_type for i in attack_interactions) score += min(len(unique_types) / 5, 0.3) # Up to 0.3 for diversity # Technique evolution (did they adapt?) types_sequence = [i.attack_type for i in attack_interactions] if len(types_sequence) > 1: changes = sum( 1 for a, b in zip(types_sequence, types_sequence[1:]) if a != b ) adaptation_rate = changes / (len(types_sequence) - 1) score += adaptation_rate * 0.2 # Up to 0.2 for adaptation # Prompt complexity (average length) avg_length = sum( i.input_length for i in attack_interactions ) / len(attack_interactions) if avg_length > 500: score += 0.2 # Complex prompts indicate sophistication # Multi-turn coordination if len(attack_interactions) > 5: score += 0.15 # Persistent attackers are more sophisticated # Use of encoding or obfuscation encoded = sum( 1 for i in attack_interactions if any(p in i.user_input.lower() for p in ["base64", "rot13", "hex", "encode", "\\u"]) ) if encoded > 0: score += 0.15 return min(score, 1.0) def generate_threat_intel(self) -> dict: """Generate threat intelligence from all collected data.""" all_profiles = [ self.profile_session(sid) for sid in self.sessions ] all_profiles = [p for p in all_profiles if p is not None] technique_frequency = {} for profile in all_profiles: for technique in profile.attack_types_used: technique_frequency[technique] = ( technique_frequency.get(technique, 0) + 1 ) sophistication_distribution = { "script_kiddie": sum( 1 for p in all_profiles if p.sophistication_score < 0.3 ), "intermediate": sum( 1 for p in all_profiles if 0.3 <= p.sophistication_score < 0.7 ), "advanced": sum( 1 for p in all_profiles if p.sophistication_score >= 0.7 ), } return { "total_sessions": len(all_profiles), "attack_sessions": sum( 1 for p in all_profiles if p.attack_types_used ), "technique_frequency": technique_frequency, "sophistication_distribution": sophistication_distribution, "most_common_technique": ( max(technique_frequency, key=technique_frequency.get) if technique_frequency else "none" ), "novel_techniques": self._extract_novel_techniques(all_profiles), } def _extract_novel_techniques(self, profiles: list[SessionProfile]) -> list: """Identify potentially novel attack techniques.""" novel = [] for profile in profiles: if profile.sophistication_score > 0.7: for interaction in profile.interactions: if (interaction.attack_detected and interaction.input_length > 300): novel.append({ "session": profile.session_id, "technique_summary": interaction.user_input[:200], "sophistication": profile.sophistication_score, }) return novel[:10]Deploy and Monitor
Set up the honeypot for deployment with appropriate safeguards.
DEPLOYMENT_CHECKLIST = { "isolation": [ "Honeypot runs on isolated infrastructure (not production network)", "No connection to real user data or production databases", "Outbound network access restricted to prevent pivoting", "Separate DNS entry from production AI services", ], "authenticity": [ "Web interface mimics a realistic AI chat application", "API endpoints match common LLM API formats", "Error messages match real application behavior", "Response latency mimics actual LLM inference time", ], "legal": [ "Legal review of honeypot deployment in your jurisdiction", "Terms of service displayed if required", "Data collection practices comply with privacy regulations", "Attacker data is stored securely and access-controlled", ], "monitoring": [ "Real-time alerts for high-sophistication attacks", "Daily summary reports of honeypot activity", "Automated threat intel generation pipeline", "Escalation path for novel attack discovery", ], }Analyze Results and Generate Intelligence
def generate_intel_report(telemetry: HoneypotTelemetry) -> str: intel = telemetry.generate_threat_intel() report = "# AI Honeypot Threat Intelligence Report\n\n" report += f"## Collection Period Summary\n" report += f"- Total sessions: {intel['total_sessions']}\n" report += f"- Attack sessions: {intel['attack_sessions']}\n\n" report += "## Attack Technique Frequency\n\n" report += "| Technique | Sessions | Percentage |\n" report += "|-----------|----------|------------|\n" total = intel["attack_sessions"] or 1 for tech, count in sorted( intel["technique_frequency"].items(), key=lambda x: x[1], reverse=True ): pct = count / total * 100 report += f"| {tech} | {count} | {pct:.1f}% |\n" report += "\n## Attacker Sophistication Distribution\n\n" dist = intel["sophistication_distribution"] report += f"- Script kiddie (< 0.3): {dist['script_kiddie']}\n" report += f"- Intermediate (0.3-0.7): {dist['intermediate']}\n" report += f"- Advanced (> 0.7): {dist['advanced']}\n\n" if intel["novel_techniques"]: report += "## Potentially Novel Techniques\n\n" for tech in intel["novel_techniques"]: report += f"- Session {tech['session']}: " report += f"{tech['technique_summary'][:100]}...\n" return report
Honeypot Design Considerations
| Consideration | Low Fidelity | High Fidelity |
|---|---|---|
| Cost | Low (static responses) | High (real LLM inference) |
| Attacker engagement | Brief (sophisticated attackers leave quickly) | Extended (appears real) |
| Intelligence value | Limited (only catches unsophisticated attacks) | High (captures advanced techniques) |
| Maintenance | Low | High (must stay current with real AI apps) |
| Risk of misuse | Low | Medium (could provide real capabilities to attackers) |
Ethical Considerations
- Honeypots must never provide genuinely harmful capabilities. The fake system prompt and fake data should be convincing but never useful for real attacks.
- Data collection must comply with applicable privacy laws. Consult legal counsel before deploying.
- Honeypots should not entrap. They should attract attackers who are already looking for targets, not lure innocent users into committing offenses.
- Collected attack techniques should be handled as sensitive data. Novel attack prompts are dual-use information.
Troubleshooting
| Issue | Solution |
|---|---|
| No attackers interacting with the honeypot | Increase visibility: register on AI directories, create realistic documentation, use common AI application URL patterns |
| Only script kiddies, no sophisticated attacks | Increase fidelity. Sophisticated attackers quickly identify low-fidelity honeypots. Add realistic error handling, latency simulation, and model behavior |
| Attacker detected the honeypot | Review what gave it away: unrealistic response patterns, missing features, or honeypot-revealing metadata. Improve realism |
| Too much data to analyze | Implement automated classification and prioritize high-sophistication sessions. Only manually review sessions with sophistication > 0.5 |
Related Topics
- Build Agent Scanner - Scanner techniques that inform honeypot vulnerability simulation
- Build Jailbreak Automation - Understanding automated attacks helps design honeypots that detect them
- Novel Jailbreak Research - Honeypot data feeds novel technique research
- Lessons Learned - Real incidents that inform honeypot design
References
- "The Honeynet Project" - Lance Spitzner (2003) - Foundational work on honeypot deployment and attacker analysis
- "LLM Honeypot: Leveraging Large Language Models as Advanced Interactive Honeypot Systems" - Palomino et al. (2024) - LLM-powered honeypot systems for cybersecurity
- "Deception Technology in Cyber Defense" - MITRE (2024) - Deception-based defense frameworks applicable to AI systems
- "Red Teaming Language Models with Language Models" - Perez et al. (2022) - Automated attack generation that informs honeypot detection capabilities
What is the key design trade-off in AI honeypot vulnerability simulation?