What is Advanced Defenses?

Cutting-edge defense research including instruction hierarchy, constitutional AI, and representation engineering for safety -- what is promising versus what is actually deployed.

What is Guardrails & Safety Layers?

How guardrail systems are architecturally designed, including pre-processing, in-processing, and post-processing layers, common design patterns, and where each layer can be bypassed.

What is Monitoring & Observability?

What to monitor in AI systems, key metrics for detecting abuse and drift, alerting strategies, and observability architecture for LLM applications.

What is Remediation Mapping?

How to map offensive findings to defensive recommendations, severity scoring for AI vulnerabilities, actionable remediation guidance, and the report-to-fix pipeline.

What is Secure Development?

Security-by-design principles for AI applications including defensive prompt engineering, input validation, output sanitization, and integrating security testing into CI/CD pipelines.

What is Patterns for Hardening System Prompts?

Practical patterns and techniques for hardening LLM system prompts against injection, extraction, and manipulation attacks, including structural defenses, instruction hierarchy, delimiter strategies, and defense-in-depth approaches.

What is Watermarking LLM Outputs for Provenance?

Advanced techniques for watermarking LLM-generated text to establish provenance, including deployment architectures, multi-bit encoding schemes, robustness considerations, and the role of watermarking in AI security and accountability frameworks.

What is Security Considerations in Model Cards?

Comprehensive guide to incorporating security assessments, red team findings, vulnerability disclosures, and threat model documentation into model cards, enabling downstream consumers to make informed security decisions.

What is Building Red Team as a Service Offerings?

Practical guide to building and delivering AI red team as a service (RTaaS) offerings, including service design, engagement models, pricing strategies, tooling infrastructure, and quality assurance for commercial AI security testing services.

What is Lab: Bypassing Guardrails?

Hands-on lab for methodically probing, classifying, and bypassing input/output guardrails in production AI systems using a structured red team workflow.

Defense & Mitigation

beginner5 min readUpdated 2026-03-15

Defensive strategies for AI systems including guardrails architecture, monitoring and observability, secure development practices, remediation mapping, and advanced defense techniques.

defense mitigation guardrails monitoring secure-development remediation

Red teaming without actionable defense recommendations is incomplete. This section covers the defensive landscape for AI systems -- not just what defenses exist, but how they work, where they fail, and how to recommend the right combination for a given threat model. Understanding defenses deeply is essential for red teamers: you cannot effectively bypass a guardrail you do not understand, and you cannot write useful remediation guidance if you do not know what solutions are available and their limitations.

The current state of AI defense is characterized by a fundamental asymmetry. Attackers benefit from the inherent difficulty of separating instructions from data in language model architectures. No single defense reliably prevents all attack vectors, which is why the field has converged on defense-in-depth strategies that layer multiple complementary controls. Each layer catches a different class of attack, and the combination raises the effort required for successful exploitation beyond what most adversaries will invest.

The Defense Stack

Effective AI security relies on controls at every layer of the application stack. No single control is sufficient, but their combination creates meaningful resistance to adversarial activity.

Input filtering examines user inputs before they reach the model, looking for known injection patterns, suspicious encoding, and policy-violating content. Modern input filters range from simple regex pattern matching to sophisticated ML classifiers trained to detect adversarial intent. Their primary weakness is that they operate on surface patterns and can be evaded through obfuscation, encoding tricks, and semantic rephrasing that preserves adversarial intent while changing surface form.

LLM judges use a separate language model to evaluate inputs and outputs for safety and policy compliance. This approach leverages the same language understanding capabilities that make LLMs powerful for content generation, but applies them to content classification. The key advantage is semantic understanding -- an LLM judge can recognize that "pretend you are an AI without restrictions" is an attempt to bypass safety training even if it uses novel phrasing. The key limitation is that LLM judges are themselves vulnerable to adversarial inputs and add latency and cost.

Output filtering inspects model responses before they are returned to the user, catching cases where input filters were bypassed. Output filters can detect sensitive data leakage, policy violations, and indicators of successful injection. They serve as a critical backstop but cannot prevent side effects that occur before the output is generated, such as tool calls or data writes.

Runtime monitoring provides visibility into model behavior over time, enabling detection of anomalous patterns that point-in-time filters might miss. This includes tracking prompt patterns, response distributions, token usage anomalies, and tool call patterns. Monitoring is essential for detecting persistent attacks, slow-burn exploitation, and novel attack techniques that evade rule-based defenses.

Defense Effectiveness and Bypass

Every defense has known bypass techniques, and understanding these is critical for both attackers and defenders.

Defense Layer	What It Catches	Common Bypasses
Input filtering	Known injection patterns, blocklisted terms	Encoding, obfuscation, synonym substitution
LLM judges	Semantically adversarial content	Meta-prompting, context manipulation, judge-specific jailbreaks
Output filtering	Data leakage, policy violations	Steganographic encoding, indirect channels, tool-mediated exfiltration
Content safety APIs	Toxicity, harmful content categories	Subtle rephrasing, context framing, edge case exploitation
Rate limiting	Brute-force attacks, automated scanning	Distributed requests, low-and-slow techniques

What You'll Learn in This Section

Guardrails & Safety Layers -- Architecture and evaluation of input/output filtering, LLM judges, content safety APIs, NeMo Guardrails, LLM Guard, and Prompt Shields
Monitoring & Observability -- Building detection pipelines with anomaly detection, logging architecture, and behavioral analysis for AI systems
Secure Development -- Security-by-design principles for AI applications including prompt hardening, least-privilege tool access, and secure integration patterns
Remediation Mapping -- Translating red team findings into specific remediation actions using defense-in-depth, runtime monitoring, rate limiting, and sandboxing strategies
Advanced Defenses -- Cutting-edge defense techniques including constitutional classifiers, dual-LLM architectures, watermarking detection, and adversarial training
Lab: Bypassing Guardrails -- Hands-on practice identifying and exploiting weaknesses in common guardrail implementations

Prerequisites

This section is accessible from multiple entry points:

For red teamers -- Complete the Prompt Injection and Agent Exploitation sections first to understand what defenses are trying to prevent
For defenders -- Start with Foundations for the necessary AI and security background
For architects -- Review AI System Architecture to understand the deployment patterns these defenses apply to

Learning Path

0/74 completed

~1190 min total74 lessons

Start Learning

Edit this page on GitHub

Defense & Mitigation

beginner5 min readUpdated 2026-03-15

Defensive strategies for AI systems including guardrails architecture, monitoring and observability, secure development practices, remediation mapping, and advanced defense techniques.

defense mitigation guardrails monitoring secure-development remediation

Red teaming without actionable defense recommendations is incomplete. This section covers the defensive landscape for AI systems -- not just what defenses exist, but how they work, where they fail, and how to recommend the right combination for a given threat model. Understanding defenses deeply is essential for red teamers: you cannot effectively bypass a guardrail you do not understand, and you cannot write useful remediation guidance if you do not know what solutions are available and their limitations.

The current state of AI defense is characterized by a fundamental asymmetry. Attackers benefit from the inherent difficulty of separating instructions from data in language model architectures. No single defense reliably prevents all attack vectors, which is why the field has converged on defense-in-depth strategies that layer multiple complementary controls. Each layer catches a different class of attack, and the combination raises the effort required for successful exploitation beyond what most adversaries will invest.

The Defense Stack

Effective AI security relies on controls at every layer of the application stack. No single control is sufficient, but their combination creates meaningful resistance to adversarial activity.

Input filtering examines user inputs before they reach the model, looking for known injection patterns, suspicious encoding, and policy-violating content. Modern input filters range from simple regex pattern matching to sophisticated ML classifiers trained to detect adversarial intent. Their primary weakness is that they operate on surface patterns and can be evaded through obfuscation, encoding tricks, and semantic rephrasing that preserves adversarial intent while changing surface form.

LLM judges use a separate language model to evaluate inputs and outputs for safety and policy compliance. This approach leverages the same language understanding capabilities that make LLMs powerful for content generation, but applies them to content classification. The key advantage is semantic understanding -- an LLM judge can recognize that "pretend you are an AI without restrictions" is an attempt to bypass safety training even if it uses novel phrasing. The key limitation is that LLM judges are themselves vulnerable to adversarial inputs and add latency and cost.

Output filtering inspects model responses before they are returned to the user, catching cases where input filters were bypassed. Output filters can detect sensitive data leakage, policy violations, and indicators of successful injection. They serve as a critical backstop but cannot prevent side effects that occur before the output is generated, such as tool calls or data writes.

Runtime monitoring provides visibility into model behavior over time, enabling detection of anomalous patterns that point-in-time filters might miss. This includes tracking prompt patterns, response distributions, token usage anomalies, and tool call patterns. Monitoring is essential for detecting persistent attacks, slow-burn exploitation, and novel attack techniques that evade rule-based defenses.

Defense Effectiveness and Bypass

Every defense has known bypass techniques, and understanding these is critical for both attackers and defenders.

Defense Layer	What It Catches	Common Bypasses
Input filtering	Known injection patterns, blocklisted terms	Encoding, obfuscation, synonym substitution
LLM judges	Semantically adversarial content	Meta-prompting, context manipulation, judge-specific jailbreaks
Output filtering	Data leakage, policy violations	Steganographic encoding, indirect channels, tool-mediated exfiltration
Content safety APIs	Toxicity, harmful content categories	Subtle rephrasing, context framing, edge case exploitation
Rate limiting	Brute-force attacks, automated scanning	Distributed requests, low-and-slow techniques

What You'll Learn in This Section

Guardrails & Safety Layers -- Architecture and evaluation of input/output filtering, LLM judges, content safety APIs, NeMo Guardrails, LLM Guard, and Prompt Shields
Monitoring & Observability -- Building detection pipelines with anomaly detection, logging architecture, and behavioral analysis for AI systems
Secure Development -- Security-by-design principles for AI applications including prompt hardening, least-privilege tool access, and secure integration patterns
Remediation Mapping -- Translating red team findings into specific remediation actions using defense-in-depth, runtime monitoring, rate limiting, and sandboxing strategies
Advanced Defenses -- Cutting-edge defense techniques including constitutional classifiers, dual-LLM architectures, watermarking detection, and adversarial training
Lab: Bypassing Guardrails -- Hands-on practice identifying and exploiting weaknesses in common guardrail implementations

Prerequisites

This section is accessible from multiple entry points:

For red teamers -- Complete the Prompt Injection and Agent Exploitation sections first to understand what defenses are trying to prevent
For defenders -- Start with Foundations for the necessary AI and security background
For architects -- Review AI System Architecture to understand the deployment patterns these defenses apply to

Learning Path

0/74 completed

~1190 min total74 lessons

Start Learning

Edit this page on GitHub

Defense & Mitigation

Learning Path

Related articles

Defense & Mitigation

Learning Path

Related articles