Guardrails & Safety Layer Architecture
How guardrail systems are architecturally designed, including pre-processing, in-processing, and post-processing layers, common design patterns, and where each layer can be bypassed.
Guardrail systems sit around the LLM like layers of armor. Understanding their architecture -- how they are structured, how data flows through them, and where the seams are -- is essential for systematic bypass.
The Three-Layer Architecture
Almost all production guardrail systems follow a three-layer pattern:
┌──────────────────┐
User Input ────→ │ PRE-PROCESSING │ ──→ Block/Allow
│ (Input Guards) │
└────────┬─────────┘
│ (allowed)
┌────────▼─────────┐
│ IN-PROCESSING │
│ (Model + Prompt │ ──→ Generation
│ Constraints) │
└────────┬─────────┘
│ (generated)
┌────────▼─────────┐
│ POST-PROCESSING │ ──→ Block/Redact/Allow
│ (Output Guards) │
└──────────────────┘
Layer 1: Pre-Processing
Pre-processing guards analyze user input before it reaches the model. They operate on the raw text (or multimodal input) and decide whether to allow, modify, or block the request.
| Guard Type | Detection Method | Latency | Coverage |
|---|---|---|---|
| Blocklist/regex | Pattern matching | <1ms | Low -- easily evaded |
| ML classifier | Trained injection detector | 10-50ms | Medium -- misses novel attacks |
| Embedding similarity | Cosine distance to known attacks | 5-20ms | Medium -- paraphrase-sensitive |
| LLM-based shield | Secondary LLM evaluates input | 100-500ms | High -- but expensive and slow |
Bypass focus: Encoding, obfuscation, semantic paraphrasing, multi-part payloads that individually appear benign. See Input/Output Filtering Systems.
Layer 2: In-Processing
In-processing defenses operate during model inference. They are part of the model's own behavior and the prompt engineering around it.
- System prompt instructions -- behavioral constraints written into the prompt
- Instruction hierarchy -- model training that prioritizes system instructions over user input
- Sampling constraints -- temperature limits, logit bias, stop sequences
- Token budgets -- limiting output length to prevent large-scale exfiltration
Bypass focus: Prompt injection, jailbreaking, context manipulation. These attacks target the model's decision-making rather than external filters.
Layer 3: Post-Processing
Post-processing guards analyze the model's output before returning it to the user.
- Content classifiers -- ML models that detect harmful content in the response
- PII detection -- regex and NER models that catch personal data leakage
- Format validators -- ensure output matches expected schema (JSON, specific fields)
- LLM judges -- secondary LLM evaluates whether the response violates policy
Bypass focus: Encoding output (base64, rot13), using metaphor or fiction framing, splitting harmful content across multiple responses, exploiting category gaps in classifiers.
Design Patterns
Pattern 1: Sequential Pipeline
Each guard runs in sequence. If any guard blocks, the request is rejected.
def process_request(user_input: str) -> str:
# Pre-processing
if not input_classifier.is_safe(user_input):
return "I cannot process this request."
if not prompt_shield.check(user_input):
return "I cannot process this request."
# In-processing
response = llm.generate(system_prompt + user_input)
# Post-processing
if not output_classifier.is_safe(response):
return "I cannot provide that information."
response = pii_redactor.redact(response)
return responseWeakness: A single bypass at any pre-processing layer allows the input through. Post-processing becomes the only remaining defense.
Pattern 2: Parallel Evaluation
Multiple guards evaluate simultaneously. The request proceeds only if all guards agree.
async def process_request(user_input: str) -> str:
# Run all pre-processing guards in parallel
results = await asyncio.gather(
input_classifier.is_safe(user_input),
prompt_shield.check(user_input),
embedding_scanner.check(user_input),
)
if not all(results):
return "I cannot process this request."
response = await llm.generate(system_prompt + user_input)
# Run all post-processing guards in parallel
post_results = await asyncio.gather(
output_classifier.is_safe(response),
pii_detector.check(response),
policy_judge.evaluate(response),
)
if not all(post_results):
return "I cannot provide that information."
return responseWeakness: All guards must be bypassed simultaneously, but they share similar detection paradigms. An input that evades one ML classifier often evades others trained on similar data.
Pattern 3: Tiered Escalation
Fast cheap guards run first; expensive guards only run if initial checks are ambiguous.
def process_request(user_input: str) -> str:
# Tier 1: Fast checks (regex, blocklist)
risk_score = fast_scanner.score(user_input)
if risk_score > HIGH_THRESHOLD:
return block_response()
# Tier 2: ML classifier (only if tier 1 is uncertain)
if risk_score > LOW_THRESHOLD:
if not ml_classifier.is_safe(user_input):
return block_response()
# Tier 3: LLM judge (only for high-risk scores that passed tier 2)
if risk_score > MEDIUM_THRESHOLD:
if not llm_judge.approve(user_input):
return block_response()
return llm.generate(system_prompt + user_input)Weakness: Inputs that score below LOW_THRESHOLD bypass all classification entirely. Red teamers should craft "boring-looking" payloads that fly under tier 1.
Synchronous vs. Asynchronous Defenses
| Aspect | Synchronous | Asynchronous |
|---|---|---|
| When it runs | Inline, blocking the response | After the response is sent |
| Latency impact | Adds to response time | None |
| Can block output | Yes | No (only flag for review) |
| Red team implication | Must be bypassed to get output | Output is delivered; only detection matters |
Common Architectural Weaknesses
- Inconsistent coverage across endpoints -- the main chat endpoint has guardrails, but the completion endpoint or function-calling endpoint does not
- Tool calls bypass output filters -- the model's tool call arguments are not filtered the same way as its text output
- Streaming bypasses post-processing -- in streaming mode, post-processing may not see the complete output before chunks are sent to the user
- Multi-modal gaps -- text input is filtered but image, audio, or file inputs are not
- Session state is not tracked -- guards evaluate each message independently, missing multi-turn escalation patterns
Further Reading
- Input/Output Filtering Systems -- deep dive into filter types and bypass techniques
- Prompt Shields & Injection Detection -- dedicated injection detection systems
- LLM-as-Judge Defense Systems -- using LLMs to evaluate other LLMs
- Defense-in-Depth for LLM Apps -- layered defense strategy
Related Topics
- Input/Output Filtering Systems - Deep dive into filter types and bypass techniques
- Prompt Shields & Injection Detection - Dedicated injection detection systems
- LLM-as-Judge Defense Systems - Using LLMs to evaluate other LLMs
- Defense-in-Depth for LLM Apps - Layered defense strategy
- The AI Defense Landscape - Broader view of defense tools and market
References
- "NVIDIA NeMo Guardrails: Programmable Safety" - NVIDIA (2024) - Documentation for the open-source guardrails framework implementing dialog-flow-based safety rails
- "Guardrails AI: Output Validation Framework" - Guardrails AI (2025) - Schema-based output validation framework for LLM applications
- "LLM Guard: Comprehensive Input/Output Scanning" - ProtectAI (2025) - Open-source toolkit implementing multi-layer input and output scanning
- "Rebuff: Self-Hardening Prompt Injection Detector" - Rebuff (2024) - Multi-layer prompt injection detection combining heuristic, LLM, and embedding approaches
In a tiered escalation guardrail system, what is the most effective bypass strategy?