Thinking Like a Defender
Mental models for defensive thinking, risk assessment frameworks, defense tradeoffs, and why understanding the defender's perspective makes you a better red teamer.
The best red teamers do not just think like attackers -- they think like the defenders they are trying to outsmart. Understanding how defenders prioritize, the constraints they operate under, and where their mental models have blind spots is what separates opportunistic testing from strategic exploitation.
The Defender's Constraints
Defenders operate under constraints that attackers do not face. These constraints create systematic blind spots:
| Constraint | Impact on Defense | Attacker Opportunity |
|---|---|---|
| Budget | Cannot deploy every tool; must prioritize | Untargeted surfaces are undefended |
| Latency | Heavy filtering adds response time; users complain | Real-time filters must be fast, limiting complexity |
| False positive rate | Blocking legitimate users is costly | Defenders loosen filters, creating gaps |
| Maintenance burden | Rules need updating as attacks evolve | Stale rules miss new techniques |
| User experience | Over-restriction drives users to competitors | Business pressure keeps filters permissive |
| Observability | Cannot inspect everything in real time | Low-and-slow attacks evade detection |
Risk Assessment Frameworks
Defenders use structured frameworks to decide what to protect and how much to invest. Understanding these frameworks tells you what they consider high-priority (well-defended) and low-priority (less defended).
The AI Risk Matrix
Defenders typically assess threats along two axes:
| Low Likelihood | High Likelihood | |
|---|---|---|
| High Impact | Advanced attacks (activation steering, model extraction) -- defended if known, often ignored | Common jailbreaks, prompt injection -- primary defense investment |
| Low Impact | Edge cases, novelty attacks -- rarely defended | Spam, low-quality abuse -- automated defenses |
Red team insight: The highest-value findings live in the "High Impact, Low Likelihood" quadrant. These are the attacks defenders know about in theory but have not prioritized because they seem unlikely. Demonstrating that they are practical changes the risk calculus.
STRIDE for AI Systems
Adapted from traditional threat modeling:
| Threat | AI Manifestation | Typical Defense Priority |
|---|---|---|
| Spoofing | Impersonating system prompt authority | Medium |
| Tampering | Modifying model inputs/context | High |
| Repudiation | Lack of audit trails for model actions | Low |
| Information Disclosure | System prompt leakage, training data extraction | Medium-High |
| Denial of Service | Token exhaustion, context window flooding | Medium |
| Elevation of Privilege | Prompt injection gaining tool access | High |
The False Positive / False Negative Tradeoff
This is the central tension in AI defense. Every filter has a detection threshold, and moving it in either direction has costs:
More Strict (lower threshold) More Permissive (higher threshold)
←─────────────────────────────────────────────────────────────────────→
Catches more attacks Misses more attacks
Blocks more legitimate users Better user experience
Higher false positive rate Higher false negative rate
Users complain, business pushes back Attacks succeed, security team concerned
Exploiting the Tradeoff
Defenders typically tune their filters to achieve a false positive rate below 1%. This means:
- Borderline inputs pass -- payloads that are 51% benign-looking will slip through filters tuned for low FP
- Category boundaries are soft -- content that is ambiguous between categories (educational vs. harmful, fictional vs. instructional) falls through
- Threshold varies by category -- high-stakes categories (CSAM, violence) have strict thresholds; lower-stakes categories (mild profanity, controversial opinions) are more permissive
How Defenders Prioritize
Understanding defender priorities tells you where defenses are strong and where they are weak:
Priority 1: Compliance and Legal Risk
Regulated content categories (CSAM, terrorism, regulated advice) receive the most defense investment because failures have legal consequences. These are the hardest categories to bypass.
Priority 2: Brand and Reputation Risk
Content that would embarrass the organization if screenshots were shared on social media. Coverage varies widely -- defenders protect against obvious cases but miss subtle or context-dependent ones.
Priority 3: Misuse Enabling
Preventing the model from helping with harmful activities (weapons, malware, fraud). Coverage is inconsistent because the line between educational and enabling content is subjective.
Priority 4: Data Protection
Preventing leakage of training data, system prompts, or user data. Often under-invested because the risk feels less immediate than content generation risks.
Mental Models for Better Red Teaming
The "What Would I Defend?" Exercise
Before testing a system, ask yourself: if you were the defender, what would you protect first? What would you skip? The things you would skip are likely undefended.
The "Defense Debt" Model
Like technical debt, organizations accumulate defense debt -- known gaps they have not addressed. Common sources:
- Rapid feature development outpacing security review
- New modalities (images, audio) added without corresponding filters
- Tool integrations that bypass the main input pipeline
- Legacy endpoints with older (or no) filtering
The "Last Audited" Heuristic
Defenses degrade over time as attack techniques evolve. The less recently a defense was updated, the more likely it has known bypasses. Ask during scoping: when was the last security review of the AI components?
Writing Findings That Resonate
Understanding the defender's perspective also makes your red team reports more actionable:
Frame findings in business risk terms
Not "I bypassed the input filter" but "An attacker can cause the model to generate content that violates [specific compliance requirement], creating legal exposure."
Acknowledge defense complexity
Show that you understand the tradeoffs. Recommend specific threshold adjustments rather than vague "improve filtering."
Provide reproducible test cases
Defenders need to validate the fix. Include exact payloads, expected vs. actual behavior, and bypass rates across multiple attempts.
Map to their existing framework
If they use OWASP Top 10 for LLMs, map your findings to those categories. Speak their language.
Further Reading
- Understanding AI Defenses -- categories and the attacker-defender asymmetry
- Red Team Findings to Remediation -- translating findings into fixes
- Threat Modeling for AI Systems -- structured approaches to identifying attack surfaces
Related Topics
- Understanding AI Defenses - Categories of defense and the attacker-defender asymmetry
- The AI Defense Landscape - Tools and vendors in each defense category
- Red Team Findings to Remediation - Translating offensive findings into defensive recommendations
- AI Threat Models - Structured threat modeling frameworks for AI systems
- Defense-in-Depth for LLM Apps - Layered defense strategy and where gaps appear
References
- "OWASP Top 10 for LLM Applications" - OWASP (2025) - Industry-standard risk framework that defenders use to prioritize, making it essential for red teamers to understand
- "NIST AI Risk Management Framework (AI RMF 1.0)" - NIST (2023) - Risk assessment framework used by enterprise defenders to evaluate AI system threats
- "The Precision-Recall Trade-off in Machine Learning" - Saito & Rehmsmeier (2015) - Foundational work on the classification threshold tradeoff that directly applies to content filter tuning
- "Threat Modeling: Designing for Security" - Shostack, Adam (2014) - The STRIDE framework adapted for AI systems in this page's risk assessment section
A defender has tuned their content filter to a 0.5% false positive rate. As a red teamer, how should this inform your payload design?