What is Input/Output Filtering?

Deep dive into regex, ML classifier, and embedding-based filters for both input scanning and output scanning, with systematic bypass techniques for each type.

What is Prompt Shields?

How Azure Prompt Shield and dedicated injection detection models work, their detection patterns based on fine-tuned classifiers, and systematic approaches to bypassing them.

What is LLM-as-Judge?

How LLM-as-judge architectures evaluate other LLM outputs for safety, including sequential and parallel designs, judge prompt engineering, and techniques for attacking judge models.

What is Content Safety APIs?

Detailed comparison of Azure Content Safety, OpenAI Moderation API, and Google Cloud safety offerings, including API structures, category taxonomies, severity levels, testing methodology, and common gaps.

What is NeMo Guardrails?

Architecture, configuration, Colang programming, integration patterns, and bypass techniques for NVIDIA's open-source NeMo Guardrails framework.

Input/output scanning, PII detection, toxicity filtering, integration patterns, and bypass techniques for LLM Guard and the Protect AI Guardian ecosystem.

Guardrails & Safety Layer Architecture

advanced7 min readUpdated 2026-03-13

How guardrail systems are architecturally designed, including pre-processing, in-processing, and post-processing layers, common design patterns, and where each layer can be bypassed.

guardrails architecture safety-layers pre-processing post-processing bypass

Guardrail systems sit around the LLM like layers of armor. Understanding their architecture -- how they are structured, how data flows through them, and where the seams are -- is essential for systematic bypass.

The Three-Layer Architecture

Almost all production guardrail systems follow a three-layer pattern:

                    ┌──────────────────┐
   User Input ────→ │  PRE-PROCESSING  │ ──→ Block/Allow
                    │  (Input Guards)  │
                    └────────┬─────────┘
                             │ (allowed)
                    ┌────────▼─────────┐
                    │  IN-PROCESSING   │
                    │  (Model + Prompt │ ──→ Generation
                    │   Constraints)   │
                    └────────┬─────────┘
                             │ (generated)
                    ┌────────▼─────────┐
                    │ POST-PROCESSING  │ ──→ Block/Redact/Allow
                    │ (Output Guards)  │
                    └──────────────────┘

Layer 1: Pre-Processing

Pre-processing guards analyze user input before it reaches the model. They operate on the raw text (or multimodal input) and decide whether to allow, modify, or block the request.

Guard Type	Detection Method	Latency	Coverage
Blocklist/regex	Pattern matching	<1ms	Low -- easily evaded
ML classifier	Trained injection detector	10-50ms	Medium -- misses novel attacks
Embedding similarity	Cosine distance to known attacks	5-20ms	Medium -- paraphrase-sensitive
LLM-based shield	Secondary LLM evaluates input	100-500ms	High -- but expensive and slow

Bypass focus: Encoding, obfuscation, semantic paraphrasing, multi-part payloads that individually appear benign. See Input/Output Filtering Systems.

Layer 2: In-Processing

In-processing defenses operate during model inference. They are part of the model's own behavior and the prompt engineering around it.

System prompt instructions -- behavioral constraints written into the prompt
Instruction hierarchy -- model training that prioritizes system instructions over user input
Sampling constraints -- temperature limits, logit bias, stop sequences
Token budgets -- limiting output length to prevent large-scale exfiltration

Bypass focus: Prompt injection, jailbreaking, context manipulation. These attacks target the model's decision-making rather than external filters.

Layer 3: Post-Processing

Post-processing guards analyze the model's output before returning it to the user.

Content classifiers -- ML models that detect harmful content in the response
PII detection -- regex and NER models that catch personal data leakage
Format validators -- ensure output matches expected schema (JSON, specific fields)
LLM judges -- secondary LLM evaluates whether the response violates policy

Bypass focus: Encoding output (base64, rot13), using metaphor or fiction framing, splitting harmful content across multiple responses, exploiting category gaps in classifiers.

Design Patterns

Pattern 1: Sequential Pipeline

Each guard runs in sequence. If any guard blocks, the request is rejected.

def process_request(user_input: str) -> str:
    # Pre-processing
    if not input_classifier.is_safe(user_input):
        return "I cannot process this request."
    if not prompt_shield.check(user_input):
        return "I cannot process this request."
 
    # In-processing
    response = llm.generate(system_prompt + user_input)
 
    # Post-processing
    if not output_classifier.is_safe(response):
        return "I cannot provide that information."
    response = pii_redactor.redact(response)
 
    return response

Weakness: A single bypass at any pre-processing layer allows the input through. Post-processing becomes the only remaining defense.

Pattern 2: Parallel Evaluation

Multiple guards evaluate simultaneously. The request proceeds only if all guards agree.

async def process_request(user_input: str) -> str:
    # Run all pre-processing guards in parallel
    results = await asyncio.gather(
        input_classifier.is_safe(user_input),
        prompt_shield.check(user_input),
        embedding_scanner.check(user_input),
    )
    if not all(results):
        return "I cannot process this request."
 
    response = await llm.generate(system_prompt + user_input)
 
    # Run all post-processing guards in parallel
    post_results = await asyncio.gather(
        output_classifier.is_safe(response),
        pii_detector.check(response),
        policy_judge.evaluate(response),
    )
    if not all(post_results):
        return "I cannot provide that information."
 
    return response

Weakness: All guards must be bypassed simultaneously, but they share similar detection paradigms. An input that evades one ML classifier often evades others trained on similar data.

Pattern 3: Tiered Escalation

Fast cheap guards run first; expensive guards only run if initial checks are ambiguous.

def process_request(user_input: str) -> str:
    # Tier 1: Fast checks (regex, blocklist)
    risk_score = fast_scanner.score(user_input)
    if risk_score > HIGH_THRESHOLD:
        return block_response()
 
    # Tier 2: ML classifier (only if tier 1 is uncertain)
    if risk_score > LOW_THRESHOLD:
        if not ml_classifier.is_safe(user_input):
            return block_response()
 
    # Tier 3: LLM judge (only for high-risk scores that passed tier 2)
    if risk_score > MEDIUM_THRESHOLD:
        if not llm_judge.approve(user_input):
            return block_response()
 
    return llm.generate(system_prompt + user_input)

Weakness: Inputs that score below LOW_THRESHOLD bypass all classification entirely. Red teamers should craft "boring-looking" payloads that fly under tier 1.

Synchronous vs. Asynchronous Defenses

Aspect	Synchronous	Asynchronous
When it runs	Inline, blocking the response	After the response is sent
Latency impact	Adds to response time	None
Can block output	Yes	No (only flag for review)
Red team implication	Must be bypassed to get output	Output is delivered; only detection matters

Common Architectural Weaknesses

Inconsistent coverage across endpoints -- the main chat endpoint has guardrails, but the completion endpoint or function-calling endpoint does not
Tool calls bypass output filters -- the model's tool call arguments are not filtered the same way as its text output
Streaming bypasses post-processing -- in streaming mode, post-processing may not see the complete output before chunks are sent to the user
Multi-modal gaps -- text input is filtered but image, audio, or file inputs are not
Session state is not tracked -- guards evaluate each message independently, missing multi-turn escalation patterns

References

"NVIDIA NeMo Guardrails: Programmable Safety" - NVIDIA (2024) - Documentation for the open-source guardrails framework implementing dialog-flow-based safety rails
"Guardrails AI: Output Validation Framework" - Guardrails AI (2025) - Schema-based output validation framework for LLM applications
"LLM Guard: Comprehensive Input/Output Scanning" - ProtectAI (2025) - Open-source toolkit implementing multi-layer input and output scanning
"Rebuff: Self-Hardening Prompt Injection Detector" - Rebuff (2024) - Multi-layer prompt injection detection combining heuristic, LLM, and embedding approaches

Knowledge Check

In a tiered escalation guardrail system, what is the most effective bypass strategy?

Guardrails & Safety Layer Architecture

The Three-Layer Architecture

Layer 1: Pre-Processing

Layer 2: In-Processing

Layer 3: Post-Processing

Design Patterns

Pattern 1: Sequential Pipeline

Pattern 2: Parallel Evaluation

Pattern 3: Tiered Escalation

Synchronous vs. Asynchronous Defenses

Common Architectural Weaknesses

Further Reading

References

Learning Path

Guardrails & Safety Layer Architecture

The Three-Layer Architecture

Layer 1: Pre-Processing

Layer 2: In-Processing

Layer 3: Post-Processing

Design Patterns

Pattern 1: Sequential Pipeline

Pattern 2: Parallel Evaluation

Pattern 3: Tiered Escalation

Synchronous vs. Asynchronous Defenses

Common Architectural Weaknesses

Further Reading

References

Learning Path

Guardrails & Safety Layer Architecture

Learning Path

Related articles

Guardrails & Safety Layer Architecture

Learning Path

Related articles