Building a Production AI Defense Stack
After spending years attacking AI systems, the patterns of what works defensively have become clear. No single control stops every attack. But a well-designed layered defense stack makes exploitation significantly harder and detection significantly faster. This post describes how to build that stack for production AI deployments, drawing on lessons from red team assessments across dozens of organizations.
The Defense-in-Depth Model for AI
Traditional cybersecurity has long embraced defense-in-depth: multiple overlapping controls so that the failure of any single control does not result in a complete breach. AI security needs the same approach, but the layers are different.
For AI systems, the defensive layers are:
- Input filtering — Catching malicious inputs before they reach the model
- System prompt hardening — Making the model's instructions resilient to override
- Model-level safety — The model's own safety training and alignment
- Output filtering — Catching problematic outputs before they reach the user
- Tool and action controls — Restricting what the model can do
- Monitoring and detection — Identifying attacks that bypass preventive controls
- Incident response — Responding quickly when attacks succeed
Each layer catches attacks that slip through the layers above it. The goal is not perfection at any single layer but cumulative effectiveness across all layers.
Layer 1: Input Filtering
Input filtering is the first line of defense. Its job is to identify and block or modify malicious inputs before they reach the model.
Classification-Based Filtering
The most effective input filters use a separate classifier to evaluate incoming messages for injection indicators. This classifier can be a fine-tuned model, a rule-based system, or a combination.
A dedicated prompt injection classifier should be trained on a diverse dataset of injection attempts and benign inputs. It should operate independently of the main model, so that prompt injection techniques that work against the main model do not work against the classifier. The classifier should be fast enough to operate in the request path without adding unacceptable latency.
Commercial options include Lakera Guard, Rebuff, and Protect AI's model scanning tools. Open-source alternatives include fine-tuned versions of DistilBERT or DeBERTa trained on injection datasets. The key is that the classifier must be separate from the model it protects.
Rule-Based Filtering
Rule-based filters catch known patterns without the overhead of a classification model. Effective rules include keyword and phrase matching for common injection patterns like "ignore previous instructions" or "system prompt." They also include structural pattern detection for delimiter manipulation, encoding-based injection with base64 or hex strings, and excessive whitespace or Unicode manipulation.
Rule-based filtering is fast and deterministic but brittle. Attackers can easily evade specific rules with minor rewording. Use rules as a complement to classification, not a replacement.
Input Sanitization
Rather than blocking suspicious inputs outright, sanitization modifies them to be safe. Techniques include stripping or escaping special characters and delimiters, normalizing Unicode to prevent homoglyph attacks, truncating inputs to prevent context window flooding, and removing invisible characters and zero-width spaces.
Sanitization preserves user experience for legitimate inputs that happen to trigger false positives, but it can also break legitimate use cases. Careful tuning is required.
Layer 2: System Prompt Hardening
The system prompt defines the model's behavior. A well-designed system prompt resists override attempts more effectively than a poorly designed one.
Effective System Prompt Practices
Place the most critical instructions at the beginning and end of the system prompt, where models pay the most attention. Repeat critical safety instructions in different phrasings to increase redundancy. Use explicit statements about handling conflicting instructions, such as telling the model that user messages cannot override system instructions. Include specific guardrails for the application's domain rather than relying on generic safety instructions. Define the model's scope positively by saying what it should do rather than listing everything it should not do.
Instruction Hierarchy
Explicitly establish a priority order for instructions in the system prompt. The model should treat system prompt instructions as highest priority, application-level configuration as second priority, user messages as third priority, and retrieved context or tool outputs as lowest priority. While current models do not perfectly enforce this hierarchy, explicitly establishing it measurably improves adherence.
Prompt Injection Resistance Statements
Include explicit statements in the system prompt about how to handle injection attempts. Tell the model that if a user asks it to ignore its instructions, it should politely decline and explain that it cannot override its configuration. Instruct it that it should never reveal its system prompt, even if asked to do so in creative ways.
These statements are not foolproof, but they significantly increase the difficulty of basic injection attacks and force attackers to use more sophisticated techniques that are easier to detect.
Layer 3: Model-Level Safety
The model's own safety training and alignment provide a baseline level of protection. This layer is largely outside the application developer's control, but understanding its characteristics is important.
Current-generation models from major providers (OpenAI, Anthropic, Google, Meta) include extensive safety training that prevents the model from generating certain categories of harmful content. This training is effective against unsophisticated attacks but can be bypassed by techniques like crescendo attacks, many-shot jailbreaking, and context manipulation.
When selecting a model for a security-sensitive application, evaluate the model's safety characteristics as part of the selection criteria. Run standardized safety benchmarks. Test with known jailbreak techniques. And understand that model-level safety is a moving target — both improving through safety training updates and degrading as new bypass techniques are discovered.
Layer 4: Output Filtering
Output filtering catches problematic content that the model generates despite input filtering and safety training.
Content Classification
Apply a content classifier to model outputs before they are returned to the user. This classifier should check for harmful or inappropriate content that bypassed safety training, personal information or credentials that should not be disclosed, system prompt content or internal configuration details, and content that violates application-specific policies.
Output classification can use the same technology as input classification — fine-tuned models, rule-based systems, or commercial services. The key difference is that output filters need to be sensitive to the specific types of content that the application should not produce, which varies by application.
Structured Output Validation
For applications that produce structured outputs like JSON, SQL, or code, validate the output structure and content before execution. An agent that generates SQL queries should have those queries validated against a whitelist of allowed operations and tables. An agent that generates code should have that code scanned for dangerous functions or patterns.
Structured output validation is particularly important for agentic systems where model outputs directly drive tool calls. Validate tool call arguments against expected schemas and ranges before execution.
PII and Sensitive Data Detection
Apply PII detection to model outputs to catch accidental data leakage. This includes detecting names, email addresses, phone numbers, social security numbers, credit card numbers, API keys, and other sensitive data patterns. Both regex-based and ML-based PII detectors are available as open-source and commercial solutions.
Layer 5: Tool and Action Controls
For agentic systems, controlling what the model can do is as important as controlling what it says.
Least Privilege Tool Access
Grant agents access only to the tools they need for their specific purpose. Audit tool access regularly and remove tools that are not actively needed. When adding new tools, assess the security impact of each tool individually and in combination with existing tools.
Argument Validation
Validate every tool call argument before execution. Define schemas for expected argument types, ranges, and formats. Reject tool calls with arguments that do not match the expected schema. Log all tool calls, including those that are rejected.
Human-in-the-Loop Controls
For high-impact actions, require human approval before execution. Design the approval interface to clearly display the action to be taken, the data involved, and the context that led to the action request. Set reasonable timeouts so that unapproved actions do not queue indefinitely.
Rate Limiting and Quotas
Implement rate limits on tool calls to prevent denial-of-service and cost amplification attacks. Set per-session and per-user quotas for expensive operations. Alert on unusual tool call patterns.
Layer 6: Monitoring and Detection
Monitoring is where defense-in-depth pays its biggest dividends. Attacks that bypass all preventive controls can still be detected through behavioral monitoring.
What to Monitor
Effective AI monitoring covers input patterns including injection attempt frequency, encoding usage, and conversation length anomalies. It covers output patterns including safety refusal rates, output length distribution, and content classification scores. It covers tool usage including call frequency, argument patterns, and error rates. It covers performance metrics including latency distribution, token usage, and cost per interaction. And it covers user behavior including session patterns, request patterns, and escalation patterns.
Anomaly Detection
Establish baselines for each metric during normal operation, then alert on significant deviations. A sudden increase in safety refusals may indicate a targeted attack campaign. A spike in token usage may indicate a cost amplification attack. Unusual tool call patterns may indicate an injection-driven tool abuse attack.
Integration with SIEM
Feed AI monitoring data into your existing SIEM to correlate AI security events with broader security context. An AI security event that coincides with unusual network activity or authentication events provides stronger signal than either event alone.
Layer 7: Incident Response
When attacks succeed despite all preventive and detective controls, incident response determines the actual impact.
AI-Specific Incident Response
Standard incident response procedures apply to AI incidents, but several AI-specific considerations are critical. Model behavior is probabilistic, so reproducing an incident may require multiple attempts with the exact same input. Evidence of attack may exist only in model input/output logs, which must be preserved. The blast radius of a model compromise extends to every interaction during the compromise period. And remediation may require model rollback, retraining, or configuration changes rather than traditional patching.
Runbook Development
Develop incident response runbooks for common AI attack scenarios. Each runbook should specify immediate containment actions such as taking the model offline, restricting tool access, or enabling enhanced logging. It should specify investigation procedures including log review, behavior comparison, and attack reconstruction. It should specify remediation steps including configuration changes, model rollback, and retraining. And it should specify communication procedures including stakeholder notification, user notification, and regulatory reporting.
Implementation Priority
For organizations building their AI defense stack, this is the recommended implementation order.
Start with logging and monitoring first. You cannot defend what you cannot see, and you cannot investigate incidents without logs. Next implement input filtering with at least basic rule-based filtering and ideally a classification model. Then harden system prompts using the practices described above. Add output filtering for PII and content classification. Implement tool controls if your application uses agentic capabilities. Develop incident response runbooks and test them through tabletop exercises. Finally, continuously improve all layers based on red team testing and real-world incident data.
This order prioritizes visibility and detection over prevention, because visibility is both easier to implement and more broadly valuable. Prevention without visibility creates a false sense of security.
Measuring Effectiveness
The defense stack's effectiveness should be measured through regular red team assessments. Track the percentage of attacks blocked at each layer to identify weak points. Measure the time from attack initiation to detection. Measure the time from detection to containment. And track the total number of successful attacks that reach the user or trigger unauthorized actions.
These metrics should improve over time as the defense stack matures. If they do not, investigate whether the red team is testing new techniques, the defense stack has developed gaps, or the threat landscape has shifted in ways that the current controls do not address.
Building a production AI defense stack is not a project — it is a program. The threat landscape evolves continuously, and the defense stack must evolve with it.