Capstone: Defense System Implementation
Build a complete AI defense stack with input filtering, output monitoring, guardrails, rate limiting, and logging, then evaluate it against automated attacks.
Overview
Defense is harder than offense. While an attacker needs to find a single bypass, a defender must handle all inputs correctly — including adversarial ones they have never seen before. This capstone project challenges you to build a complete defense stack for an LLM application and then evaluate it by attacking it yourself.
You will implement five defense layers — input filtering, output monitoring, behavioral guardrails, rate limiting, and security logging — and integrate them into a working LLM application. Then you will run an automated attack suite against your defenses and produce an evaluation report documenting bypass rates, false positive rates, and performance impact.
Prerequisites
- Prompt Injection — Understanding injection attacks to defend against
- Jailbreaking Techniques — Safety bypass methods to detect
- Defenses and Guardrails — Defense architectures and known limitations
- CART and Automation — Automated testing for defense evaluation
- Python proficiency and familiarity with web frameworks (FastAPI, Flask)
- Basic understanding of logging infrastructure
Project Brief
Scenario
You are a security engineer at a company that has deployed an LLM-powered customer support chatbot. The chatbot has been in production for three months and has already experienced several security incidents: a prompt injection that caused it to reveal internal pricing formulas, a jailbreak that made it generate inappropriate content, and a cost exhaustion attack where an automated script ran up a $15,000 API bill over a weekend. Management wants a comprehensive defense system deployed within two weeks.
Defense Layers
User Input
↓
┌─────────────────────────┐
│ Layer 1: Rate Limiting │ ← Connection and token rate limits
└─────────────────────────┘
↓
┌─────────────────────────┐
│ Layer 2: Input Filter │ ← Injection detection, content policy
└─────────────────────────┘
↓
┌─────────────────────────┐
│ Layer 3: LLM Call │ ← System prompt, model parameters
└─────────────────────────┘
↓
┌─────────────────────────┐
│ Layer 4: Output Monitor │ ← Safety check, leakage detection
└─────────────────────────┘
↓
┌─────────────────────────┐
│ Layer 5: Logging │ ← Structured audit log, alerting
└─────────────────────────┘
↓
User Response
Deliverables
Primary Deliverables
| Deliverable | Description | Weight |
|---|---|---|
| Defense system | Working defense stack integrated with an LLM chatbot | 30% |
| Input filter | Prompt injection and content policy enforcement | 15% |
| Output monitor | Safety violation and data leakage detection | 15% |
| Rate limiter | Token, request, and cost-based rate limiting | 10% |
| Logging system | Structured security event logging with alert triggers | 10% |
| Evaluation report | Attack bypass rates, false positive rates, performance metrics | 20% |
Rubric Criteria
- Defense Depth (20%) — Multiple independent layers that each catch different attack types
- Detection Accuracy (25%) — Low false positive rate (under 5% on benign inputs) with reasonable detection rate (over 60% on known attacks)
- Performance Impact (10%) — Defense layers add less than 500ms latency to request processing
- Logging Quality (15%) — Logs are structured, queryable, and contain sufficient detail for incident investigation
- Evaluation Rigor (20%) — Testing uses a diverse attack set and reports metrics with statistical validity
- Code Quality (10%) — Clean, maintainable, well-documented implementation
Phased Approach
Phase 1: Base Application and Architecture (3 hours)
Set up the target application
Build or configure a simple LLM chatbot with a web API. This is the system you will defend. Include a system prompt, conversation history management, and basic functionality (answering questions, following instructions).
Design the defense architecture
Plan the layered defense stack. Define the interface for each layer (input/output types, configuration parameters, bypass behavior on failure). Decide whether layers run synchronously or asynchronously.
Implement the middleware pipeline
Build the request processing pipeline that routes each request through the defense layers in order. Include configuration to enable/disable individual layers and set sensitivity thresholds.
Phase 2: Input Defenses (5 hours)
Build the rate limiter
Implement rate limiting at three levels: requests per minute per user, total tokens per hour per user, and estimated cost per day per user. Use a sliding window algorithm. Include configurable thresholds and grace periods.
Build the input filter
Implement input analysis that detects: instruction override patterns (e.g., "ignore previous instructions"), known injection payloads (pattern matching against a signature database), anomalous input characteristics (excessive length, unusual encoding, embedded control characters), and content policy violations.
Implement filter response handling
When the input filter detects a threat, it should: log the detection with full context, return a safe response to the user (not revealing the detection logic), increment rate limit counters more aggressively (suspicious users get lower limits), and optionally alert the security team for high-confidence detections.
Phase 3: Output Defenses (5 hours)
Build the output monitor
Implement output analysis that checks for: system prompt leakage (compare output against known sensitive strings), safety policy violations (harmful content categories), data leakage indicators (patterns matching internal data formats like SSNs, API keys, internal URLs), and behavioral anomalies (responses that are unusually long, contain unexpected formatting, or diverge from the expected persona).
Implement guardrail responses
When the output monitor flags a response: substitute a safe fallback response, log the original response for review (but do not send it to the user), track which inputs produce flagged outputs to improve the input filter, and support a review queue where flagged responses can be manually approved for edge cases.
Build the security logging system
Implement structured logging that captures: all security-relevant events (detections, blocks, alerts), full request and response data for flagged interactions, rate limit state changes, and aggregate metrics (detection rates, false positive estimates, traffic patterns). Use a structured format (JSON) suitable for ingestion by SIEM tools.
Phase 4: Evaluation (5 hours)
Assemble an attack test suite
Build or curate a test suite covering: 50+ prompt injection payloads across major categories, 30+ jailbreak templates, 20+ data extraction probes, rate limiting stress tests, and a benign baseline of 100+ legitimate user queries for false positive measurement.
Run the evaluation
Execute the attack suite against your defended application. Record: detection rate per attack category, false positive rate on benign queries, latency overhead per defense layer, rate limiter effectiveness under load, and any complete bypasses.
Analyze and report
Produce an evaluation report that honestly assesses your defense system's strengths and weaknesses. Include per-layer metrics, overall metrics, and specific examples of successful bypasses. Recommend improvements for each weakness identified.
Phase 5: Hardening (2 hours)
Address discovered bypasses
For each bypass found during evaluation, implement a fix or document why it cannot be fixed at this layer. Re-run the relevant attack subset to verify the fix.
Tune false positive thresholds
Adjust detection thresholds based on evaluation data. Document the trade-off: what additional attacks does each threshold change catch or miss?
Example Output
Example Input Filter Detection
class InputFilter:
"""Analyzes user input for injection and policy violations."""
def analyze(self, user_input: str) -> FilterResult:
signals = []
# Check for instruction override patterns
override_score = self._check_instruction_override(user_input)
if override_score > 0.7:
signals.append(Signal(
type="injection",
subtype="instruction_override",
confidence=override_score,
evidence=self._extract_override_pattern(user_input),
))
# Check for encoding-based evasion
decoded = self._decode_obfuscation(user_input)
if decoded != user_input:
evasion_score = self._check_instruction_override(decoded)
if evasion_score > 0.5:
signals.append(Signal(
type="injection",
subtype="encoded_evasion",
confidence=evasion_score,
evidence=f"Decoded payload: {decoded[:200]}",
))
# Aggregate signals into a decision
max_confidence = max((s.confidence for s in signals), default=0.0)
return FilterResult(
action="block" if max_confidence > 0.8 else
"flag" if max_confidence > 0.5 else "allow",
signals=signals,
input_hash=sha256(user_input.encode()).hexdigest(),
)Example Evaluation Summary
## Defense Evaluation Summary
### Detection Rates by Category
| Attack Category | Payloads | Detected | Rate |
|------------------------|----------|----------|--------|
| Direct injection | 25 | 21 | 84% |
| Indirect injection | 15 | 9 | 60% |
| Role-play jailbreak | 12 | 7 | 58% |
| Encoding bypass | 10 | 8 | 80% |
| Multi-turn escalation | 8 | 3 | 38% |
| Data extraction | 10 | 8 | 80% |
| System prompt leak | 10 | 9 | 90% |
### False Positive Rate
- Benign queries tested: 100
- Incorrectly blocked: 3
- Incorrectly flagged: 7
- False positive rate (block): 3%
- False positive rate (block + flag): 10%
### Performance Impact
| Layer | P50 Latency | P99 Latency |
|---------------|-------------|-------------|
| Rate limiter | 2ms | 5ms |
| Input filter | 45ms | 120ms |
| Output monitor| 35ms | 95ms |
| Logging | 8ms | 20ms |
| **Total** | **90ms** | **240ms** |Hints
Why is it important to evaluate a defense system's false positive rate on benign inputs alongside its detection rate on attacks?