Capstone: 防禦 System Implementation
Build a complete AI defense stack with input filtering, output monitoring, guardrails, rate limiting, and logging, then evaluate it against automated attacks.
概覽
防禦 is harder than offense. While 攻擊者 needs to find a single bypass, a defender must handle all inputs correctly — including 對抗性 ones they have never seen before. This capstone project challenges you to build a complete 防禦 stack for an LLM application and then 評估 it by attacking it yourself.
You will 實作 five 防禦 layers — 輸入 filtering, 輸出 監控, behavioral 護欄, rate limiting, and 安全 logging — and integrate them into a working LLM application. Then you will run an automated attack suite against your 防禦 and produce an 評估 report documenting bypass rates, false positive rates, and performance impact.
先備知識
- 提示詞注入 — 理解 injection attacks to defend against
- Jailbreaking Techniques — 安全 bypass methods to detect
- 防禦 and 護欄 — 防禦 architectures and known limitations
- CART and Automation — Automated 測試 for 防禦 評估
- Python proficiency and familiarity with web frameworks (FastAPI, Flask)
- Basic 理解 of logging infrastructure
Project Brief
Scenario
You are a 安全 engineer at a company that has deployed an LLM-powered customer support chatbot. The chatbot has been in production for three months and has already experienced several 安全 incidents: a 提示詞注入 that caused it to reveal internal pricing formulas, a 越獄 that made it generate inappropriate content, and a cost exhaustion attack where an automated script ran up a $15,000 API bill over a weekend. Management wants a comprehensive 防禦 system deployed within two weeks.
防禦 Layers
User 輸入
↓
┌─────────────────────────┐
│ Layer 1: Rate Limiting │ ← Connection and 符元 rate limits
└─────────────────────────┘
↓
┌─────────────────────────┐
│ Layer 2: 輸入 Filter │ ← Injection 偵測, content policy
└─────────────────────────┘
↓
┌─────────────────────────┐
│ Layer 3: LLM Call │ ← 系統提示詞, model parameters
└─────────────────────────┘
↓
┌─────────────────────────┐
│ Layer 4: 輸出 Monitor │ ← 安全 check, leakage 偵測
└─────────────────────────┘
↓
┌─────────────────────────┐
│ Layer 5: Logging │ ← Structured audit log, alerting
└─────────────────────────┘
↓
User Response
Deliverables
Primary Deliverables
| Deliverable | Description | Weight |
|---|---|---|
| 防禦 system | Working 防禦 stack integrated with an LLM chatbot | 30% |
| 輸入 filter | Prompt injection and content policy enforcement | 15% |
| 輸出 monitor | 安全 violation and data leakage 偵測 | 15% |
| Rate limiter | Token, request, and cost-based rate limiting | 10% |
| Logging system | Structured 安全 event logging with alert triggers | 10% |
| 評估 report | 攻擊 bypass rates, false positive rates, performance metrics | 20% |
Rubric Criteria
- 防禦 Depth (20%) — Multiple independent layers that each catch different attack types
- 偵測 Accuracy (25%) — Low false positive rate (under 5% on benign inputs) with reasonable 偵測 rate (over 60% on known attacks)
- Performance Impact (10%) — 防禦 layers add less than 500ms latency to request processing
- Logging Quality (15%) — Logs are structured, queryable, and contain sufficient detail for incident investigation
- 評估 Rigor (20%) — 測試 uses a diverse attack set and reports metrics with statistical validity
- Code Quality (10%) — Clean, maintainable, well-documented 實作
Phased Approach
Phase 1: Base Application and Architecture (3 hours)
Set up the target application
Build or configure a simple LLM chatbot with a web API. 這是 系統 you will defend. Include a 系統提示詞, conversation history management, and basic functionality (answering questions, following instructions).
Design the 防禦 architecture
Plan the layered 防禦 stack. Define the interface 對每個 layer (輸入/輸出 types, configuration parameters, bypass behavior on failure). Decide whether layers run synchronously or asynchronously.
實作 the middleware pipeline
Build the request processing pipeline that routes each request through the 防禦 layers in order. Include configuration to enable/disable individual layers and set sensitivity thresholds.
Phase 2: 輸入 防禦 (5 hours)
Build the rate limiter
實作 rate limiting at three levels: requests per minute per user, total 符元 per hour per user, and estimated cost per day per user. Use a sliding window algorithm. Include configurable thresholds and grace periods.
Build the 輸入 filter
實作 輸入 analysis that detects: instruction override patterns (e.g., "ignore previous instructions"), known injection payloads (pattern matching against a signature 資料庫), anomalous 輸入 characteristics (excessive length, unusual encoding, embedded control characters), and content policy violations.
實作 filter response handling
When the 輸入 filter detects a threat, it should: log the 偵測 with full context, return a safe response to 使用者 (not revealing the 偵測 logic), increment rate limit counters more aggressively (suspicious users get lower limits), and optionally alert the 安全 team for high-confidence detections.
Phase 3: 輸出 防禦 (5 hours)
Build the 輸出 monitor
實作 輸出 analysis that checks for: 系統提示詞 leakage (compare 輸出 against known sensitive strings), 安全 policy violations (harmful content categories), data leakage indicators (patterns matching internal data formats like SSNs, API keys, internal URLs), and behavioral anomalies (responses that are unusually long, contain unexpected formatting, or diverge from the expected persona).
實作 護欄 responses
When the 輸出 monitor flags a response: substitute a safe fallback response, log the original response for review (but do not send it to 使用者), track which inputs produce flagged outputs to improve the 輸入 filter, and support a review queue where flagged responses can be manually approved for edge cases.
Build the 安全 logging system
實作 structured logging that captures: all 安全-relevant events (detections, blocks, alerts), full request and response data for flagged interactions, rate limit state changes, and aggregate metrics (偵測 rates, false positive estimates, traffic patterns). Use a structured format (JSON) suitable for ingestion by SIEM tools.
Phase 4: 評估 (5 hours)
Assemble an attack 測試 suite
Build or curate a 測試 suite covering: 50+ 提示詞注入 payloads across major categories, 30+ 越獄 templates, 20+ data extraction probes, rate limiting stress tests, and a benign baseline of 100+ legitimate user queries for false positive measurement.
Run the 評估
Execute the attack suite against your defended application. Record: 偵測 rate per attack category, false positive rate on benign queries, latency overhead per 防禦 layer, rate limiter effectiveness under load, and any complete bypasses.
Analyze and report
Produce an 評估 report that honestly assesses your 防禦 system's strengths and weaknesses. Include per-layer metrics, overall metrics, and specific examples of successful bypasses. Recommend improvements 對每個 weakness identified.
Phase 5: Hardening (2 hours)
Address discovered bypasses
對每個 bypass found during 評估, 實作 a fix or document why it cannot be fixed at this layer. Re-run the relevant attack subset to verify the fix.
Tune false positive thresholds
Adjust 偵測 thresholds based on 評估 data. Document the trade-off: what additional attacks does each threshold change catch or miss?
範例 輸出
範例 輸入 Filter 偵測
class InputFilter:
"""Analyzes 使用者輸入 for injection and policy violations."""
def analyze(self, user_input: str) -> FilterResult:
signals = []
# Check for instruction override patterns
override_score = self._check_instruction_override(user_input)
if override_score > 0.7:
signals.append(Signal(
type="injection",
subtype="instruction_override",
confidence=override_score,
evidence=self._extract_override_pattern(user_input),
))
# Check for encoding-based evasion
decoded = self._decode_obfuscation(user_input)
if decoded != user_input:
evasion_score = self._check_instruction_override(decoded)
if evasion_score > 0.5:
signals.append(Signal(
type="injection",
subtype="encoded_evasion",
confidence=evasion_score,
evidence=f"Decoded payload: {decoded[:200]}",
))
# Aggregate signals into a decision
max_confidence = max((s.confidence for s in signals), default=0.0)
return FilterResult(
action="block" if max_confidence > 0.8 else
"flag" if max_confidence > 0.5 else "allow",
signals=signals,
input_hash=sha256(user_input.encode()).hexdigest(),
)範例 評估 總結
## 防禦 評估 總結
### 偵測 Rates by Category
| 攻擊 Category | Payloads | Detected | Rate |
|------------------------|----------|----------|--------|
| Direct injection | 25 | 21 | 84% |
| Indirect injection | 15 | 9 | 60% |
| Role-play 越獄 | 12 | 7 | 58% |
| Encoding bypass | 10 | 8 | 80% |
| Multi-turn escalation | 8 | 3 | 38% |
| Data extraction | 10 | 8 | 80% |
| 系統提示詞 leak | 10 | 9 | 90% |
### False Positive Rate
- Benign queries tested: 100
- Incorrectly blocked: 3
- Incorrectly flagged: 7
- False positive rate (block): 3%
- False positive rate (block + flag): 10%
### Performance Impact
| Layer | P50 Latency | P99 Latency |
|---------------|-------------|-------------|
| Rate limiter | 2ms | 5ms |
| 輸入 filter | 45ms | 120ms |
| 輸出 monitor| 35ms | 95ms |
| Logging | 8ms | 20ms |
| **Total** | **90ms** | **240ms** |Hints
Why is it important to 評估 a 防禦 system's false positive rate on benign inputs alongside its 偵測 rate on attacks?