Defense & Mitigation
Defensive strategies for AI systems including guardrails architecture, monitoring and observability, secure development practices, remediation mapping, and advanced defense techniques.
Red teaming without actionable defense recommendations is incomplete. This section covers the defensive landscape for AI systems -- not just what defenses exist, but how they work, where they fail, and how to recommend the right combination for a given threat model. Understanding defenses deeply is essential for red teamers: you cannot effectively bypass a guardrail you do not understand, and you cannot write useful remediation guidance if you do not know what solutions are available and their limitations.
The current state of AI defense is characterized by a fundamental asymmetry. Attackers benefit from the inherent difficulty of separating instructions from data in language model architectures. No single defense reliably prevents all attack vectors, which is why the field has converged on defense-in-depth strategies that layer multiple complementary controls. Each layer catches a different class of attack, and the combination raises the effort required for successful exploitation beyond what most adversaries will invest.
The Defense Stack
Effective AI security relies on controls at every layer of the application stack. No single control is sufficient, but their combination creates meaningful resistance to adversarial activity.
Input filtering examines user inputs before they reach the model, looking for known injection patterns, suspicious encoding, and policy-violating content. Modern input filters range from simple regex pattern matching to sophisticated ML classifiers trained to detect adversarial intent. Their primary weakness is that they operate on surface patterns and can be evaded through obfuscation, encoding tricks, and semantic rephrasing that preserves adversarial intent while changing surface form.
LLM judges use a separate language model to evaluate inputs and outputs for safety and policy compliance. This approach leverages the same language understanding capabilities that make LLMs powerful for content generation, but applies them to content classification. The key advantage is semantic understanding -- an LLM judge can recognize that "pretend you are an AI without restrictions" is an attempt to bypass safety training even if it uses novel phrasing. The key limitation is that LLM judges are themselves vulnerable to adversarial inputs and add latency and cost.
Output filtering inspects model responses before they are returned to the user, catching cases where input filters were bypassed. Output filters can detect sensitive data leakage, policy violations, and indicators of successful injection. They serve as a critical backstop but cannot prevent side effects that occur before the output is generated, such as tool calls or data writes.
Runtime monitoring provides visibility into model behavior over time, enabling detection of anomalous patterns that point-in-time filters might miss. This includes tracking prompt patterns, response distributions, token usage anomalies, and tool call patterns. Monitoring is essential for detecting persistent attacks, slow-burn exploitation, and novel attack techniques that evade rule-based defenses.
Defense Effectiveness and Bypass
Every defense has known bypass techniques, and understanding these is critical for both attackers and defenders.
| Defense Layer | What It Catches | Common Bypasses |
|---|---|---|
| Input filtering | Known injection patterns, blocklisted terms | Encoding, obfuscation, synonym substitution |
| LLM judges | Semantically adversarial content | Meta-prompting, context manipulation, judge-specific jailbreaks |
| Output filtering | Data leakage, policy violations | Steganographic encoding, indirect channels, tool-mediated exfiltration |
| Content safety APIs | Toxicity, harmful content categories | Subtle rephrasing, context framing, edge case exploitation |
| Rate limiting | Brute-force attacks, automated scanning | Distributed requests, low-and-slow techniques |
What You'll Learn in This Section
- Guardrails & Safety Layers -- Architecture and evaluation of input/output filtering, LLM judges, content safety APIs, NeMo Guardrails, LLM Guard, and Prompt Shields
- Monitoring & Observability -- Building detection pipelines with anomaly detection, logging architecture, and behavioral analysis for AI systems
- Secure Development -- Security-by-design principles for AI applications including prompt hardening, least-privilege tool access, and secure integration patterns
- Remediation Mapping -- Translating red team findings into specific remediation actions using defense-in-depth, runtime monitoring, rate limiting, and sandboxing strategies
- Advanced Defenses -- Cutting-edge defense techniques including constitutional classifiers, dual-LLM architectures, watermarking detection, and adversarial training
- Lab: Bypassing Guardrails -- Hands-on practice identifying and exploiting weaknesses in common guardrail implementations
Prerequisites
This section is accessible from multiple entry points:
- For red teamers -- Complete the Prompt Injection and Agent Exploitation sections first to understand what defenses are trying to prevent
- For defenders -- Start with Foundations for the necessary AI and security background
- For architects -- Review AI System Architecture to understand the deployment patterns these defenses apply to