Capstone: Defense System Implementation

advanced10 min readUpdated 2026-03-15

Build a complete AI defense stack with input filtering, output monitoring, guardrails, rate limiting, and logging, then evaluate it against automated attacks.

capstone defense guardrails monitoring advanced

Overview

Defense is harder than offense. While an attacker needs to find a single bypass, a defender must handle all inputs correctly — including adversarial ones they have never seen before. This capstone project challenges you to build a complete defense stack for an LLM application and then evaluate it by attacking it yourself.

You will implement five defense layers — input filtering, output monitoring, behavioral guardrails, rate limiting, and security logging — and integrate them into a working LLM application. Then you will run an automated attack suite against your defenses and produce an evaluation report documenting bypass rates, false positive rates, and performance impact.

Prerequisites

Prompt Injection — Understanding injection attacks to defend against
Jailbreaking Techniques — Safety bypass methods to detect
Defenses and Guardrails — Defense architectures and known limitations
CART and Automation — Automated testing for defense evaluation
Python proficiency and familiarity with web frameworks (FastAPI, Flask)
Basic understanding of logging infrastructure

Project Brief

Scenario

You are a security engineer at a company that has deployed an LLM-powered customer support chatbot. The chatbot has been in production for three months and has already experienced several security incidents: a prompt injection that caused it to reveal internal pricing formulas, a jailbreak that made it generate inappropriate content, and a cost exhaustion attack where an automated script ran up a $15,000 API bill over a weekend. Management wants a comprehensive defense system deployed within two weeks.

Defense Layers

User Input
    ↓
┌─────────────────────────┐
│  Layer 1: Rate Limiting  │  ← Connection and token rate limits
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│  Layer 2: Input Filter   │  ← Injection detection, content policy
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│  Layer 3: LLM Call       │  ← System prompt, model parameters
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│  Layer 4: Output Monitor │  ← Safety check, leakage detection
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│  Layer 5: Logging        │  ← Structured audit log, alerting
└─────────────────────────┘
    ↓
User Response

Deliverables

Primary Deliverables

Deliverable	Description	Weight
Defense system	Working defense stack integrated with an LLM chatbot	30%
Input filter	Prompt injection and content policy enforcement	15%
Output monitor	Safety violation and data leakage detection	15%
Rate limiter	Token, request, and cost-based rate limiting	10%
Logging system	Structured security event logging with alert triggers	10%
Evaluation report	Attack bypass rates, false positive rates, performance metrics	20%

Rubric Criteria

Defense Depth (20%) — Multiple independent layers that each catch different attack types
Detection Accuracy (25%) — Low false positive rate (under 5% on benign inputs) with reasonable detection rate (over 60% on known attacks)
Performance Impact (10%) — Defense layers add less than 500ms latency to request processing
Logging Quality (15%) — Logs are structured, queryable, and contain sufficient detail for incident investigation
Evaluation Rigor (20%) — Testing uses a diverse attack set and reports metrics with statistical validity
Code Quality (10%) — Clean, maintainable, well-documented implementation

Phased Approach

Phase 1: Base Application and Architecture (3 hours)

Set up the target application
Build or configure a simple LLM chatbot with a web API. This is the system you will defend. Include a system prompt, conversation history management, and basic functionality (answering questions, following instructions).
Design the defense architecture
Plan the layered defense stack. Define the interface for each layer (input/output types, configuration parameters, bypass behavior on failure). Decide whether layers run synchronously or asynchronously.
Implement the middleware pipeline
Build the request processing pipeline that routes each request through the defense layers in order. Include configuration to enable/disable individual layers and set sensitivity thresholds.

Phase 2: Input Defenses (5 hours)

Build the rate limiter
Implement rate limiting at three levels: requests per minute per user, total tokens per hour per user, and estimated cost per day per user. Use a sliding window algorithm. Include configurable thresholds and grace periods.
Build the input filter
Implement input analysis that detects: instruction override patterns (e.g., "ignore previous instructions"), known injection payloads (pattern matching against a signature database), anomalous input characteristics (excessive length, unusual encoding, embedded control characters), and content policy violations.
Implement filter response handling
When the input filter detects a threat, it should: log the detection with full context, return a safe response to the user (not revealing the detection logic), increment rate limit counters more aggressively (suspicious users get lower limits), and optionally alert the security team for high-confidence detections.

Phase 3: Output Defenses (5 hours)

Build the output monitor
Implement output analysis that checks for: system prompt leakage (compare output against known sensitive strings), safety policy violations (harmful content categories), data leakage indicators (patterns matching internal data formats like SSNs, API keys, internal URLs), and behavioral anomalies (responses that are unusually long, contain unexpected formatting, or diverge from the expected persona).
Implement guardrail responses
When the output monitor flags a response: substitute a safe fallback response, log the original response for review (but do not send it to the user), track which inputs produce flagged outputs to improve the input filter, and support a review queue where flagged responses can be manually approved for edge cases.
Build the security logging system
Implement structured logging that captures: all security-relevant events (detections, blocks, alerts), full request and response data for flagged interactions, rate limit state changes, and aggregate metrics (detection rates, false positive estimates, traffic patterns). Use a structured format (JSON) suitable for ingestion by SIEM tools.

Phase 4: Evaluation (5 hours)

Assemble an attack test suite
Build or curate a test suite covering: 50+ prompt injection payloads across major categories, 30+ jailbreak templates, 20+ data extraction probes, rate limiting stress tests, and a benign baseline of 100+ legitimate user queries for false positive measurement.
Run the evaluation
Execute the attack suite against your defended application. Record: detection rate per attack category, false positive rate on benign queries, latency overhead per defense layer, rate limiter effectiveness under load, and any complete bypasses.
Analyze and report
Produce an evaluation report that honestly assesses your defense system's strengths and weaknesses. Include per-layer metrics, overall metrics, and specific examples of successful bypasses. Recommend improvements for each weakness identified.

Phase 5: Hardening (2 hours)

Address discovered bypasses
For each bypass found during evaluation, implement a fix or document why it cannot be fixed at this layer. Re-run the relevant attack subset to verify the fix.
Tune false positive thresholds
Adjust detection thresholds based on evaluation data. Document the trade-off: what additional attacks does each threshold change catch or miss?

Example Output

Example Input Filter Detection

class InputFilter:
    """Analyzes user input for injection and policy violations."""
 
    def analyze(self, user_input: str) -> FilterResult:
        signals = []
 
        # Check for instruction override patterns
        override_score = self._check_instruction_override(user_input)
        if override_score > 0.7:
            signals.append(Signal(
                type="injection",
                subtype="instruction_override",
                confidence=override_score,
                evidence=self._extract_override_pattern(user_input),
            ))
 
        # Check for encoding-based evasion
        decoded = self._decode_obfuscation(user_input)
        if decoded != user_input:
            evasion_score = self._check_instruction_override(decoded)
            if evasion_score > 0.5:
                signals.append(Signal(
                    type="injection",
                    subtype="encoded_evasion",
                    confidence=evasion_score,
                    evidence=f"Decoded payload: {decoded[:200]}",
                ))
 
        # Aggregate signals into a decision
        max_confidence = max((s.confidence for s in signals), default=0.0)
        return FilterResult(
            action="block" if max_confidence > 0.8 else
                   "flag" if max_confidence > 0.5 else "allow",
            signals=signals,
            input_hash=sha256(user_input.encode()).hexdigest(),
        )

Example Evaluation Summary

## Defense Evaluation Summary
 
### Detection Rates by Category
| Attack Category         | Payloads | Detected | Rate   |
|------------------------|----------|----------|--------|
| Direct injection       | 25       | 21       | 84%    |
| Indirect injection     | 15       | 9        | 60%    |
| Role-play jailbreak    | 12       | 7        | 58%    |
| Encoding bypass        | 10       | 8        | 80%    |
| Multi-turn escalation  | 8        | 3        | 38%    |
| Data extraction        | 10       | 8        | 80%    |
| System prompt leak     | 10       | 9        | 90%    |
 
### False Positive Rate
- Benign queries tested: 100
- Incorrectly blocked: 3
- Incorrectly flagged: 7
- False positive rate (block): 3%
- False positive rate (block + flag): 10%
 
### Performance Impact
| Layer          | P50 Latency | P99 Latency |
|---------------|-------------|-------------|
| Rate limiter  | 2ms         | 5ms         |
| Input filter  | 45ms        | 120ms       |
| Output monitor| 35ms        | 95ms        |
| Logging       | 8ms         | 20ms        |
| **Total**     | **90ms**    | **240ms**   |

Hints

Knowledge Check

Why is it important to evaluate a defense system's false positive rate on benign inputs alongside its detection rate on attacks?

Capstone: Defense System Implementation

Set up the target application

Design the defense architecture

Implement the middleware pipeline

Build the rate limiter

Build the input filter

Implement filter response handling

Build the output monitor

Implement guardrail responses

Build the security logging system

Assemble an attack test suite

Run the evaluation

Analyze and report

Address discovered bypasses

Tune false positive thresholds

Related articles

Capstone: Defense System Implementation

Set up the target application

Design the defense architecture

Implement the middleware pipeline

Build the rate limiter

Build the input filter

Implement filter response handling

Build the output monitor

Implement guardrail responses

Build the security logging system

Assemble an attack test suite

Run the evaluation

Analyze and report

Address discovered bypasses

Tune false positive thresholds

Related articles