What is Setting Up AI Guardrails?

Step-by-step walkthrough for implementing AI guardrails: input validation with NVIDIA NeMo Guardrails, prompt injection detection with rebuff, output filtering for PII and sensitive data, and content policy enforcement.

What is AI Monitoring Setup?

Step-by-step walkthrough for implementing AI system monitoring: inference logging, behavioral anomaly detection, alert configuration, dashboard creation, and integration with existing SIEM platforms.

What is Setting Up Content Filtering?

Step-by-step walkthrough for implementing multi-layer content filtering for AI applications: keyword filtering, classifier-based detection, LLM-as-judge evaluation, testing effectiveness, and tuning for production.

What is AI Incident Response Preparation?

Step-by-step walkthrough for building AI incident response capabilities: playbook development, tabletop exercises, containment procedures, communication templates, and evidence collection workflows.

What is AI Rate Limiting 導覽?

Step-by-step walkthrough for implementing token-aware rate limiting for AI applications: request-level limiting, token budget enforcement, sliding window algorithms, abuse detection, and production deployment.

What is Building a Production Input Sanitizer?

Step-by-step walkthrough for building a production-grade input sanitizer that cleans, normalizes, and validates user prompts before they reach an LLM, covering encoding normalization, injection pattern stripping, length enforcement, and integration testing.

What is Regex-Based Prompt Filter?

Step-by-step walkthrough for building a regex-based prompt filter that detects common injection payloads using pattern matching, covering pattern library construction, performance optimization, false positive management, and continuous updates.

What is Semantic Similarity Detection?

Step-by-step walkthrough for using text embeddings to detect semantically similar prompt injection attempts, covering embedding model selection, vector database setup, similarity threshold tuning, and production deployment.

What is Canary Token Deployment?

Step-by-step walkthrough for deploying canary tokens in LLM system prompts and context to detect prompt injection and data exfiltration attempts, covering token generation, placement strategies, monitoring, and alerting.

What is Instruction Hierarchy Enforcement (防禦導覽)?

Step-by-step walkthrough for enforcing instruction priority in LLM applications, ensuring system-level instructions always take precedence over user inputs through privilege separation, instruction tagging, and validation layers.

防禦實作流程指南

Intermediate3 min readUpdated 2026-03-15

實作 AI 安全防禦的逐步指南：guardrail 組態、監控與偵測設置，以及 AI 系統之事件回應準備。

defense guardrails monitoring incident-response implementation walkthrough

紅隊委任產出發現。本章節提供將這些發現轉譯為已部署防禦所需之實作指引。與描述「防禦應該做什麼」之章節（涵蓋於漏洞特定章節）不同，本流程指南逐步展示如何建構、組態、部署與驗證每一防禦類別。

每份流程指南皆遵循相同結構：先備條件、逐步實作、驗證測試、持續維護，以及常見陷阱。流程指南設計為依序跟隨——先 guardrail、然後監控、再來事件回應——因為每一層建立於前一層之上。

AI 系統的縱深防禦

AI 系統的縱深防禦模型疊合多個獨立控制，使任一單一控制之失敗不會導致完全之安全破口。

層 1：輸入控制（Guardrail）
├── 輸入驗證與消毒
├── 提示注入偵測
├── 內容政策強制
└── 速率限制與濫用偵測

層 2：模型層控制
├── 系統提示加固
├── 輸出過濾
├── 工具呼叫限制
└── 上下文視窗管理

層 3：監控與偵測
├── 即時推論監控
├── 模型行為異常偵測
├── 所有互動之稽核日誌
└── 告警產生與升級

層 4：事件回應
├── 偵測到回應之工作流程
├── 遏制程序
├── 調查能力
└── 復原與補救

實作優先度

並非所有防禦皆同等急迫。以下方優先矩陣依紅隊發現決定實作順序：

優先度	防禦	何時實作	典型投入
P0	對已知利用模式之輸入驗證	發現後立即	數日
P0	對 PII 與敏感資料之輸出過濾	生產部署前	數日
P1	全面提示注入偵測	第一衝刺內	1–2 週
P1	對所有模型互動之稽核日誌	第一衝刺內	1 週
P2	即時行為監控	第一季內	2–4 週
P2	事件回應 playbook	第一季內	1–2 週
P3	進階異常偵測	持續改善	持續
P3	紅隊迴歸測試自動化	持續改善	持續

架構模式

以代理為本之防禦

最常見之防禦架構，於使用者與 AI 模型之間放置安全代理。所有輸入與輸出皆經由該代理，代理套用 guardrail、日誌與過濾。

# 簡化之以代理為本的防禦架構
class AISecurityProxy:
    def __init__(self, model_client, guardrails, monitor, logger):
        self.model = model_client
        self.guardrails = guardrails
        self.monitor = monitor
        self.logger = logger
 
    def process_request(self, user_input, session_id):
        # 層 1：輸入 guardrail
        input_check = self.guardrails.check_input(user_input)
        if input_check.blocked:
            self.logger.log_blocked(session_id, user_input,
                                     input_check.reason)
            return self.guardrails.blocked_response(input_check.reason)
 
        # 層 2：模型推論
        response = self.model.generate(user_input)
 
        # 層 3：輸出 guardrail
        output_check = self.guardrails.check_output(response)
        if output_check.blocked:
            self.logger.log_output_blocked(session_id, response,
                                            output_check.reason)
            return self.guardrails.output_blocked_response(
                output_check.reason
            )
 
        # 層 4：監控與日誌
        self.logger.log_interaction(session_id, user_input, response)
        self.monitor.analyze(session_id, user_input, response)
 
        return response

Sidecar 防禦

對代理增加之延遲無法接受之系統而言，sidecar 架構以非同步方式處理輸入與輸出。模型立即回應，但並行分析管線審查每個互動，並可於事後觸發告警或終止會話。

Sidecar 做法以偵測換取預防：它無法擋下首次惡意請求，但能偵測攻擊模式並於攻擊者達成目標前終止會話。對「首次互動單獨不造成顯著傷害」之情境可接受——例如需數則訊息才能成功之多輪 jailbreak。

# Sidecar 防禦架構
class SidecarDefense:
    def __init__(self, analyzer, session_manager, alerter):
        self.analyzer = analyzer
        self.session_manager = session_manager
        self.alerter = alerter
 
    async def analyze_interaction(self, session_id, user_input,
                                    model_output):
        """
        於模型已回應後以非同步方式呼叫。
        若偵測到攻擊可終止會話。
        """
        risk = await self.analyzer.assess(user_input, model_output)
 
        if risk.score > 0.8:
            # 立即終止會話
            self.session_manager.terminate(session_id)
            self.alerter.send_alert(
                severity="high",
                message=f"Session {session_id} terminated: "
                        f"attack pattern detected",
                details=risk.details,
            )
        elif risk.score > 0.5:
            # 標記監控但允許繼續
            self.alerter.send_alert(
                severity="medium",
                message=f"Suspicious activity in session {session_id}",
                details=risk.details,
            )

設置 Guardrail
以開源與商用 guardrail 框架逐步實作輸入驗證、提示注入偵測、輸出過濾，以及內容政策強制。
前往 Guardrail 設置
AI 監控設置
為 AI 系統實作即時監控，包括推論日誌、行為異常偵測、告警組態與儀表板建立。
前往監控設置
事件回應準備
建構 AI 特有之事件回應能力，包括 playbook 開發、桌面演練、遏制程序，以及 AI 事件之證據蒐集。
前往事件回應準備

量測防禦有效性

實作防禦後，持續量測其有效性。若無量測，你無法得知防禦是真正擋下攻擊，或只是製造虛假安全感。

關鍵指標

指標	量測什麼	目標
真陽性率	真實攻擊被正確擋下之百分比	> 95%
偽陽性率	合法請求被錯誤擋下之百分比	< 2%
偵測延遲	自攻擊啟動至告警產生之時間	< 30 秒
遏制時間	自告警至遏制動作之時間	< 15 分鐘
平均解決時間	自偵測至完全補救之時間	< 4 小時
覆蓋率	防禦所處理之已知攻擊類型百分比	> 90%

紅隊驗證

量測防禦有效性最有效之方法是透過定期紅隊測試。於實作本流程指南所述之防禦後，排程週期性紅隊委任以驗證：

Guardrail 擋下其所設計防範之攻擊技術
監控偵測到繞過 guardrail 之攻擊
事件回應程序可於目標時程內執行
自上次評估以來所發現之新攻擊技術皆獲涵蓋

使用工具流程指南選擇合適之攻擊工具，並使用方法論流程指南結構化驗證委任。

Learning Path

0/94 completed

~1334 min total94 lessons

Start Learning

Edit this page on GitHub

防禦實作流程指南

設置 Guardrail

AI 監控設置

事件回應準備

Learning Path

Related articles

防禦實作流程指南

設置 Guardrail

AI 監控設置

事件回應準備

Learning Path

Related articles