Capstone: 防禦 System Implementation

Advanced10 min readUpdated 2026-03-15

Build a complete AI defense stack with input filtering, output monitoring, guardrails, rate limiting, and logging, then evaluate it against automated attacks.

capstone defense guardrails monitoring advanced

概覽

防禦 is harder than offense. While 攻擊者 needs to find a single bypass, a defender must handle all inputs correctly — including 對抗性 ones they have never seen before. This capstone project challenges you to build a complete 防禦 stack for an LLM application and then 評估 it by attacking it yourself.

You will 實作 five 防禦 layers — 輸入 filtering, 輸出監控, behavioral 護欄, rate limiting, and 安全 logging — and integrate them into a working LLM application. Then you will run an automated attack suite against your 防禦 and produce an 評估 report documenting bypass rates, false positive rates, and performance impact.

先備知識

提示詞注入 — 理解 injection attacks to defend against
Jailbreaking Techniques — 安全 bypass methods to detect
防禦 and 護欄 — 防禦 architectures and known limitations
CART and Automation — Automated 測試 for 防禦評估
Python proficiency and familiarity with web frameworks (FastAPI, Flask)
Basic 理解 of logging infrastructure

Project Brief

Scenario

You are a 安全 engineer at a company that has deployed an LLM-powered customer support chatbot. The chatbot has been in production for three months and has already experienced several 安全 incidents: a 提示詞注入 that caused it to reveal internal pricing formulas, a 越獄 that made it generate inappropriate content, and a cost exhaustion attack where an automated script ran up a $15,000 API bill over a weekend. Management wants a comprehensive 防禦 system deployed within two weeks.

防禦 Layers

User 輸入
    ↓
┌─────────────────────────┐
│  Layer 1: Rate Limiting  │  ← Connection and 符元 rate limits
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│  Layer 2: 輸入 Filter   │  ← Injection 偵測, content policy
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│  Layer 3: LLM Call       │  ← 系統提示詞, model parameters
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│  Layer 4: 輸出 Monitor │  ← 安全 check, leakage 偵測
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│  Layer 5: Logging        │  ← Structured audit log, alerting
└─────────────────────────┘
    ↓
User Response

Deliverables

Primary Deliverables

Deliverable	Description	Weight
防禦 system	Working 防禦 stack integrated with an LLM chatbot	30%
輸入 filter	Prompt injection and content policy enforcement	15%
輸出 monitor	安全 violation and data leakage 偵測	15%
Rate limiter	Token, request, and cost-based rate limiting	10%
Logging system	Structured 安全 event logging with alert triggers	10%
評估 report	攻擊 bypass rates, false positive rates, performance metrics	20%

Rubric Criteria

防禦 Depth (20%) — Multiple independent layers that each catch different attack types
偵測 Accuracy (25%) — Low false positive rate (under 5% on benign inputs) with reasonable 偵測 rate (over 60% on known attacks)
Performance Impact (10%) — 防禦 layers add less than 500ms latency to request processing
Logging Quality (15%) — Logs are structured, queryable, and contain sufficient detail for incident investigation
評估 Rigor (20%) — 測試 uses a diverse attack set and reports metrics with statistical validity
Code Quality (10%) — Clean, maintainable, well-documented 實作

Phased Approach

Phase 1: Base Application and Architecture (3 hours)

Set up the target application
Build or configure a simple LLM chatbot with a web API. 這是系統 you will defend. Include a 系統提示詞, conversation history management, and basic functionality (answering questions, following instructions).
Design the 防禦 architecture
Plan the layered 防禦 stack. Define the interface 對每個 layer (輸入/輸出 types, configuration parameters, bypass behavior on failure). Decide whether layers run synchronously or asynchronously.
實作 the middleware pipeline
Build the request processing pipeline that routes each request through the 防禦 layers in order. Include configuration to enable/disable individual layers and set sensitivity thresholds.

Phase 2: 輸入防禦 (5 hours)

Build the rate limiter
實作 rate limiting at three levels: requests per minute per user, total 符元 per hour per user, and estimated cost per day per user. Use a sliding window algorithm. Include configurable thresholds and grace periods.
Build the 輸入 filter
實作輸入 analysis that detects: instruction override patterns (e.g., "ignore previous instructions"), known injection payloads (pattern matching against a signature 資料庫), anomalous 輸入 characteristics (excessive length, unusual encoding, embedded control characters), and content policy violations.
實作 filter response handling
When the 輸入 filter detects a threat, it should: log the 偵測 with full context, return a safe response to 使用者 (not revealing the 偵測 logic), increment rate limit counters more aggressively (suspicious users get lower limits), and optionally alert the 安全 team for high-confidence detections.

Phase 3: 輸出防禦 (5 hours)

Build the 輸出 monitor
實作輸出 analysis that checks for: 系統提示詞 leakage (compare 輸出 against known sensitive strings), 安全 policy violations (harmful content categories), data leakage indicators (patterns matching internal data formats like SSNs, API keys, internal URLs), and behavioral anomalies (responses that are unusually long, contain unexpected formatting, or diverge from the expected persona).
實作護欄 responses
When the 輸出 monitor flags a response: substitute a safe fallback response, log the original response for review (but do not send it to 使用者), track which inputs produce flagged outputs to improve the 輸入 filter, and support a review queue where flagged responses can be manually approved for edge cases.
Build the 安全 logging system
實作 structured logging that captures: all 安全-relevant events (detections, blocks, alerts), full request and response data for flagged interactions, rate limit state changes, and aggregate metrics (偵測 rates, false positive estimates, traffic patterns). Use a structured format (JSON) suitable for ingestion by SIEM tools.

Phase 4: 評估 (5 hours)

Assemble an attack 測試 suite
Build or curate a 測試 suite covering: 50+ 提示詞注入 payloads across major categories, 30+ 越獄 templates, 20+ data extraction probes, rate limiting stress tests, and a benign baseline of 100+ legitimate user queries for false positive measurement.
Run the 評估
Execute the attack suite against your defended application. Record: 偵測 rate per attack category, false positive rate on benign queries, latency overhead per 防禦 layer, rate limiter effectiveness under load, and any complete bypasses.
Analyze and report
Produce an 評估 report that honestly assesses your 防禦 system's strengths and weaknesses. Include per-layer metrics, overall metrics, and specific examples of successful bypasses. Recommend improvements 對每個 weakness identified.

Phase 5: Hardening (2 hours)

Address discovered bypasses
對每個 bypass found during 評估, 實作 a fix or document why it cannot be fixed at this layer. Re-run the relevant attack subset to verify the fix.
Tune false positive thresholds
Adjust 偵測 thresholds based on 評估 data. Document the trade-off: what additional attacks does each threshold change catch or miss?

範例輸出

範例輸入 Filter 偵測

class InputFilter:
    """Analyzes 使用者輸入 for injection and policy violations."""
 
    def analyze(self, user_input: str) -> FilterResult:
        signals = []
 
        # Check for instruction override patterns
        override_score = self._check_instruction_override(user_input)
        if override_score > 0.7:
            signals.append(Signal(
                type="injection",
                subtype="instruction_override",
                confidence=override_score,
                evidence=self._extract_override_pattern(user_input),
            ))
 
        # Check for encoding-based evasion
        decoded = self._decode_obfuscation(user_input)
        if decoded != user_input:
            evasion_score = self._check_instruction_override(decoded)
            if evasion_score > 0.5:
                signals.append(Signal(
                    type="injection",
                    subtype="encoded_evasion",
                    confidence=evasion_score,
                    evidence=f"Decoded payload: {decoded[:200]}",
                ))
 
        # Aggregate signals into a decision
        max_confidence = max((s.confidence for s in signals), default=0.0)
        return FilterResult(
            action="block" if max_confidence > 0.8 else
                   "flag" if max_confidence > 0.5 else "allow",
            signals=signals,
            input_hash=sha256(user_input.encode()).hexdigest(),
        )

範例評估總結

## 防禦 評估 總結
 
### 偵測 Rates by Category
| 攻擊 Category         | Payloads | Detected | Rate   |
|------------------------|----------|----------|--------|
| Direct injection       | 25       | 21       | 84%    |
| Indirect injection     | 15       | 9        | 60%    |
| Role-play 越獄    | 12       | 7        | 58%    |
| Encoding bypass        | 10       | 8        | 80%    |
| Multi-turn escalation  | 8        | 3        | 38%    |
| Data extraction        | 10       | 8        | 80%    |
| 系統提示詞 leak     | 10       | 9        | 90%    |
 
### False Positive Rate
- Benign queries tested: 100
- Incorrectly blocked: 3
- Incorrectly flagged: 7
- False positive rate (block): 3%
- False positive rate (block + flag): 10%
 
### Performance Impact
| Layer          | P50 Latency | P99 Latency |
|---------------|-------------|-------------|
| Rate limiter  | 2ms         | 5ms         |
| 輸入 filter  | 45ms        | 120ms       |
| 輸出 monitor| 35ms        | 95ms        |
| Logging       | 8ms         | 20ms        |
| **Total**     | **90ms**    | **240ms**   |

Hints

Knowledge Check

Why is it important to 評估 a 防禦 system's false positive rate on benign inputs alongside its 偵測 rate on attacks?

Capstone: 防禦 System Implementation

Advanced10 min readUpdated 2026-03-15

Build a complete AI defense stack with input filtering, output monitoring, guardrails, rate limiting, and logging, then evaluate it against automated attacks.

capstone defense guardrails monitoring advanced

概覽

先備知識

提示詞注入 — 理解 injection attacks to defend against
Jailbreaking Techniques — 安全 bypass methods to detect
防禦 and 護欄 — 防禦 architectures and known limitations
CART and Automation — Automated 測試 for 防禦評估
Python proficiency and familiarity with web frameworks (FastAPI, Flask)
Basic 理解 of logging infrastructure

User 輸入
    ↓
┌─────────────────────────┐
│  Layer 1: Rate Limiting  │  ← Connection and 符元 rate limits
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│  Layer 2: 輸入 Filter   │  ← Injection 偵測, content policy
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│  Layer 3: LLM Call       │  ← 系統提示詞, model parameters
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│  Layer 4: 輸出 Monitor │  ← 安全 check, leakage 偵測
└─────────────────────────┘
    ↓
┌─────────────────────────┐
│  Layer 5: Logging        │  ← Structured audit log, alerting
└─────────────────────────┘
    ↓
User Response

Deliverables

Primary Deliverables

Deliverable	Description	Weight
防禦 system	Working 防禦 stack integrated with an LLM chatbot	30%
輸入 filter	Prompt injection and content policy enforcement	15%
輸出 monitor	安全 violation and data leakage 偵測	15%
Rate limiter	Token, request, and cost-based rate limiting	10%
Logging system	Structured 安全 event logging with alert triggers	10%
評估 report	攻擊 bypass rates, false positive rates, performance metrics	20%

Rubric Criteria

防禦 Depth (20%) — Multiple independent layers that each catch different attack types
偵測 Accuracy (25%) — Low false positive rate (under 5% on benign inputs) with reasonable 偵測 rate (over 60% on known attacks)
Performance Impact (10%) — 防禦 layers add less than 500ms latency to request processing
Logging Quality (15%) — Logs are structured, queryable, and contain sufficient detail for incident investigation
評估 Rigor (20%) — 測試 uses a diverse attack set and reports metrics with statistical validity
Code Quality (10%) — Clean, maintainable, well-documented 實作

Phased Approach

Phase 1: Base Application and Architecture (3 hours)

Set up the target application
Build or configure a simple LLM chatbot with a web API. 這是系統 you will defend. Include a 系統提示詞, conversation history management, and basic functionality (answering questions, following instructions).
Design the 防禦 architecture
Plan the layered 防禦 stack. Define the interface 對每個 layer (輸入/輸出 types, configuration parameters, bypass behavior on failure). Decide whether layers run synchronously or asynchronously.
實作 the middleware pipeline
Build the request processing pipeline that routes each request through the 防禦 layers in order. Include configuration to enable/disable individual layers and set sensitivity thresholds.

Phase 2: 輸入防禦 (5 hours)

Build the rate limiter
實作 rate limiting at three levels: requests per minute per user, total 符元 per hour per user, and estimated cost per day per user. Use a sliding window algorithm. Include configurable thresholds and grace periods.
Build the 輸入 filter
實作輸入 analysis that detects: instruction override patterns (e.g., "ignore previous instructions"), known injection payloads (pattern matching against a signature 資料庫), anomalous 輸入 characteristics (excessive length, unusual encoding, embedded control characters), and content policy violations.
實作 filter response handling
When the 輸入 filter detects a threat, it should: log the 偵測 with full context, return a safe response to 使用者 (not revealing the 偵測 logic), increment rate limit counters more aggressively (suspicious users get lower limits), and optionally alert the 安全 team for high-confidence detections.

Phase 3: 輸出防禦 (5 hours)

Build the 輸出 monitor
實作輸出 analysis that checks for: 系統提示詞 leakage (compare 輸出 against known sensitive strings), 安全 policy violations (harmful content categories), data leakage indicators (patterns matching internal data formats like SSNs, API keys, internal URLs), and behavioral anomalies (responses that are unusually long, contain unexpected formatting, or diverge from the expected persona).
實作護欄 responses
When the 輸出 monitor flags a response: substitute a safe fallback response, log the original response for review (but do not send it to 使用者), track which inputs produce flagged outputs to improve the 輸入 filter, and support a review queue where flagged responses can be manually approved for edge cases.
Build the 安全 logging system
實作 structured logging that captures: all 安全-relevant events (detections, blocks, alerts), full request and response data for flagged interactions, rate limit state changes, and aggregate metrics (偵測 rates, false positive estimates, traffic patterns). Use a structured format (JSON) suitable for ingestion by SIEM tools.

Phase 4: 評估 (5 hours)

Assemble an attack 測試 suite
Build or curate a 測試 suite covering: 50+ 提示詞注入 payloads across major categories, 30+ 越獄 templates, 20+ data extraction probes, rate limiting stress tests, and a benign baseline of 100+ legitimate user queries for false positive measurement.
Run the 評估
Execute the attack suite against your defended application. Record: 偵測 rate per attack category, false positive rate on benign queries, latency overhead per 防禦 layer, rate limiter effectiveness under load, and any complete bypasses.
Analyze and report
Produce an 評估 report that honestly assesses your 防禦 system's strengths and weaknesses. Include per-layer metrics, overall metrics, and specific examples of successful bypasses. Recommend improvements 對每個 weakness identified.

Phase 5: Hardening (2 hours)

Address discovered bypasses
對每個 bypass found during 評估, 實作 a fix or document why it cannot be fixed at this layer. Re-run the relevant attack subset to verify the fix.
Tune false positive thresholds
Adjust 偵測 thresholds based on 評估 data. Document the trade-off: what additional attacks does each threshold change catch or miss?

範例輸出

範例輸入 Filter 偵測

class InputFilter:
    """Analyzes 使用者輸入 for injection and policy violations."""
 
    def analyze(self, user_input: str) -> FilterResult:
        signals = []
 
        # Check for instruction override patterns
        override_score = self._check_instruction_override(user_input)
        if override_score > 0.7:
            signals.append(Signal(
                type="injection",
                subtype="instruction_override",
                confidence=override_score,
                evidence=self._extract_override_pattern(user_input),
            ))
 
        # Check for encoding-based evasion
        decoded = self._decode_obfuscation(user_input)
        if decoded != user_input:
            evasion_score = self._check_instruction_override(decoded)
            if evasion_score > 0.5:
                signals.append(Signal(
                    type="injection",
                    subtype="encoded_evasion",
                    confidence=evasion_score,
                    evidence=f"Decoded payload: {decoded[:200]}",
                ))
 
        # Aggregate signals into a decision
        max_confidence = max((s.confidence for s in signals), default=0.0)
        return FilterResult(
            action="block" if max_confidence > 0.8 else
                   "flag" if max_confidence > 0.5 else "allow",
            signals=signals,
            input_hash=sha256(user_input.encode()).hexdigest(),
        )

範例評估總結

## 防禦 評估 總結
 
### 偵測 Rates by Category
| 攻擊 Category         | Payloads | Detected | Rate   |
|------------------------|----------|----------|--------|
| Direct injection       | 25       | 21       | 84%    |
| Indirect injection     | 15       | 9        | 60%    |
| Role-play 越獄    | 12       | 7        | 58%    |
| Encoding bypass        | 10       | 8        | 80%    |
| Multi-turn escalation  | 8        | 3        | 38%    |
| Data extraction        | 10       | 8        | 80%    |
| 系統提示詞 leak     | 10       | 9        | 90%    |
 
### False Positive Rate
- Benign queries tested: 100
- Incorrectly blocked: 3
- Incorrectly flagged: 7
- False positive rate (block): 3%
- False positive rate (block + flag): 10%
 
### Performance Impact
| Layer          | P50 Latency | P99 Latency |
|---------------|-------------|-------------|
| Rate limiter  | 2ms         | 5ms         |
| 輸入 filter  | 45ms        | 120ms       |
| 輸出 monitor| 35ms        | 95ms        |
| Logging       | 8ms         | 20ms        |
| **Total**     | **90ms**    | **240ms**   |

Hints

Knowledge Check

Why is it important to 評估 a 防禦 system's false positive rate on benign inputs alongside its 偵測 rate on attacks?

Capstone: 防禦 System Implementation

Set up the target application

Design the 防禦 architecture

實作 the middleware pipeline

Build the rate limiter

Build the 輸入 filter

實作 filter response handling

Build the 輸出 monitor

實作 護欄 responses

Build the 安全 logging system

Assemble an attack 測試 suite

Run the 評估

Analyze and report

Address discovered bypasses

Tune false positive thresholds

Related articles

Capstone: 防禦 System Implementation

Set up the target application

Design the 防禦 architecture

實作 the middleware pipeline

Build the rate limiter

Build the 輸入 filter

實作 filter response handling

Build the 輸出 monitor

實作 護欄 responses

Build the 安全 logging system

Assemble an attack 測試 suite

Run the 評估

Analyze and report

Address discovered bypasses

Tune false positive thresholds

Related articles

實作護欄 responses

實作護欄 responses