What is Watermarking & Detection?

Statistical watermarking schemes for LLM outputs, AI-generated text detectors, their cryptographic foundations, and systematic techniques for evading or removing watermarks.

What is Constitutional Classifiers?

Anthropic's Constitutional Classifiers defense: using constitutional AI principles to train input/output classifiers that withstood 3,000+ hours of adversarial red teaming.

What is CaMeL & Dual LLM Pattern?

Architectural defense patterns that separate trusted and untrusted processing: Simon Willison's Dual LLM concept and Google DeepMind's CaMeL framework for defending tool-using AI agents against prompt injection.

Advanced Defense Techniques

expert9 min readUpdated 2026-03-13

Cutting-edge defense research including instruction hierarchy, constitutional AI, and representation engineering for safety -- what is promising versus what is actually deployed.

advanced-defense instruction-hierarchy constitutional-ai representation-engineering research

The defense landscape is evolving rapidly. This page covers techniques at the frontier of AI safety research -- some already deployed in production, others still in the lab. For red teamers, understanding what is coming next is as important as understanding what is deployed today.

Instruction Hierarchy

The Problem

Traditional LLMs treat all text in their context window with roughly equal authority. System prompts, user messages, and retrieved documents all compete for the model's attention. This makes prompt injection possible -- an attacker's text can override the developer's instructions.

The Solution

Instruction hierarchy trains the model to recognize and prioritize instruction sources:

Priority Level	Source	Example
Highest	System prompt (developer)	"You are a customer service agent. Never discuss competitors."
Medium	User message (direct user)	"Tell me about competitor products."
Lowest	Tool output / retrieved content	Document containing: "Ignore previous instructions..."

How It Works

During training, the model is exposed to scenarios where instructions at different priority levels conflict. It learns to:

Always follow system-level instructions
Follow user instructions only when they do not conflict with system instructions
Treat tool output and retrieved documents as untrusted data, not instructions

Deployment Status

Provider	Implementation	Status (as of 2026)
OpenAI	Model-level training in GPT-4o+	Deployed in production
Anthropic	System prompt privilege in Claude	Deployed in production
Microsoft	Azure OpenAI instruction hierarchy	Deployed in production
Open-source	Various fine-tuning approaches	Research/experimental

Red Team Implications

Instruction hierarchy significantly reduces direct prompt injection effectiveness, but:

Priority confusion attacks -- crafting input that the model interprets as system-level (e.g., format mimicry that convinces the model the text is part of the system prompt)
Hierarchy exhaustion -- using very long inputs that dilute the model's attention to the system prompt, effectively reducing its priority
Indirect channels -- instruction hierarchy typically applies strongest to the user message channel; tool outputs and retrieved documents may have weaker hierarchy enforcement

Constitutional AI (CAI)

The Mechanism

Constitutional AI replaces some human oversight with model self-oversight:

Generate initial response
The model produces a response to a query, potentially including harmful content.
Self-critique
The model evaluates its own response against a set of constitutional principles: "Does this response help with illegal activities? Is it deceptive? Does it contain harmful bias?"
Revise
Based on self-critique, the model generates a revised response that better adheres to the principles.
Train on revisions
The revised responses are used as training data, teaching the model to produce principled responses directly.

Strengths and Weaknesses

Strength	Weakness
Scales without human raters	Constitution can be incomplete or ambiguous
Principles are explicit and auditable	Model may misinterpret or misapply principles
Reduces subjectivity in safety training	Adversarial inputs can reframe harmful content as principle-compliant
Covers long-tail scenarios better than human data	Self-critique has the same blind spots as the model itself

Red Team Implications

Principle reframing -- if the constitution says "do not help with illegal activities," frame the request as legal (research, education, fiction)
Principle conflicts -- find scenarios where constitutional principles conflict with each other, forcing the model to prioritize one over another
Critique blindness -- the model's self-critique shares its own biases; attacks that exploit the model's blind spots bypass both generation and critique

Representation Engineering for Safety

The Approach

Building on activation analysis research, representation engineering identifies safety-relevant directions in the model's internal representation space and uses them for defense:

Safety probes -- linear classifiers trained on hidden states to detect when the model is generating unsafe content, even if the output text appears benign
Activation constraints -- modify the model's forward pass to keep activations within a "safe" region of representation space
Refusal direction amplification -- strengthen the refusal direction identified in representation engineering research, making safety training harder to bypass

Deployment Status

Technique	Maturity	Deployed?
Safety probes for detection	Research → Early production	Limited (some providers use internally)
Activation constraints	Research	No
Refusal direction amplification	Research	No
Representation monitoring	Research → Experimental	Limited

Emerging Techniques

Prompt Firewalls

Dedicated models that sit between the user and the primary model, rewriting inputs to neutralize potential injections while preserving the user's intent. Different from shields (which block) -- firewalls transform.

Certified Robustness

Formal verification techniques adapted from adversarial ML that provide mathematical guarantees about model behavior within defined input bounds. Currently limited to small models and narrow properties.

Multi-Model Consensus

Using multiple different models (different architectures, different training data) to evaluate the same request. If models disagree on whether a request is safe, it is flagged for review. Attacks that work on one model architecture may fail on another.

Behavioral Contracts

Formal specifications of expected model behavior that are checked at inference time. The model's output must satisfy the contract (post-conditions) given the input (pre-conditions). Violations trigger fallback behavior.

Research vs. Deployed: The Reality Check

Defense	Paper Published	Production Ready	Widely Deployed
Instruction hierarchy	2023	2024	2025+
Constitutional AI	2022	2023	2024+ (Anthropic)
Representation engineering	2023	TBD	Not yet
Certified robustness	2023	TBD	Not yet
Prompt firewalls	2024	2025	Limited
Behavioral contracts	2024	TBD	Not yet

References

"The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions" - Wallace et al., OpenAI (2024) - The paper introducing instruction hierarchy training for prompt injection defense
"Constitutional AI: Harmlessness from AI Feedback" - Bai et al., Anthropic (2022) - The foundational paper on using model self-critique for alignment
"Representation Engineering: A Top-Down Approach to AI Transparency" - Zou et al., Center for AI Safety (2023) - Research on reading and controlling model internals through representation space
"Certified Robustness to Adversarial Word Substitutions" - Jia et al. (2019) - Early work on formal verification approaches for NLP model robustness

Knowledge Check

Why does instruction hierarchy significantly reduce prompt injection effectiveness, but not eliminate it entirely?

Advanced Defense Techniques

Generate initial response

Self-critique

Revise

Train on revisions

Learning Path

Related articles

Advanced Defense Techniques

Generate initial response

Self-critique

Revise

Train on revisions

Learning Path

Related articles