Thinking Like a Defender

intermediate8 min readUpdated 2026-03-13

Mental models for defensive thinking, risk assessment frameworks, defense tradeoffs, and why understanding the defender's perspective makes you a better red teamer.

defender-mindset risk-assessment tradeoffs false-positives threat-modeling

The best red teamers do not just think like attackers -- they think like the defenders they are trying to outsmart. Understanding how defenders prioritize, the constraints they operate under, and where their mental models have blind spots is what separates opportunistic testing from strategic exploitation.

The Defender's Constraints

Defenders operate under constraints that attackers do not face. These constraints create systematic blind spots:

Constraint	Impact on Defense	Attacker Opportunity
Budget	Cannot deploy every tool; must prioritize	Untargeted surfaces are undefended
Latency	Heavy filtering adds response time; users complain	Real-time filters must be fast, limiting complexity
False positive rate	Blocking legitimate users is costly	Defenders loosen filters, creating gaps
Maintenance burden	Rules need updating as attacks evolve	Stale rules miss new techniques
User experience	Over-restriction drives users to competitors	Business pressure keeps filters permissive
Observability	Cannot inspect everything in real time	Low-and-slow attacks evade detection

Risk Assessment Frameworks

Defenders use structured frameworks to decide what to protect and how much to invest. Understanding these frameworks tells you what they consider high-priority (well-defended) and low-priority (less defended).

The AI Risk Matrix

Defenders typically assess threats along two axes:

	Low Likelihood	High Likelihood
High Impact	Advanced attacks (activation steering, model extraction) -- defended if known, often ignored	Common jailbreaks, prompt injection -- primary defense investment
Low Impact	Edge cases, novelty attacks -- rarely defended	Spam, low-quality abuse -- automated defenses

Red team insight: The highest-value findings live in the "High Impact, Low Likelihood" quadrant. These are the attacks defenders know about in theory but have not prioritized because they seem unlikely. Demonstrating that they are practical changes the risk calculus.

STRIDE for AI Systems

Adapted from traditional threat modeling:

Threat	AI Manifestation	Typical Defense Priority
Spoofing	Impersonating system prompt authority	Medium
Tampering	Modifying model inputs/context	High
Repudiation	Lack of audit trails for model actions	Low
Information Disclosure	System prompt leakage, training data extraction	Medium-High
Denial of Service	Token exhaustion, context window flooding	Medium
Elevation of Privilege	Prompt injection gaining tool access	High

The False Positive / False Negative Tradeoff

This is the central tension in AI defense. Every filter has a detection threshold, and moving it in either direction has costs:

More Strict (lower threshold)          More Permissive (higher threshold)
←─────────────────────────────────────────────────────────────────────→
Catches more attacks                   Misses more attacks
Blocks more legitimate users           Better user experience
Higher false positive rate             Higher false negative rate
Users complain, business pushes back   Attacks succeed, security team concerned

Exploiting the Tradeoff

Defenders typically tune their filters to achieve a false positive rate below 1%. This means:

Borderline inputs pass -- payloads that are 51% benign-looking will slip through filters tuned for low FP
Category boundaries are soft -- content that is ambiguous between categories (educational vs. harmful, fictional vs. instructional) falls through
Threshold varies by category -- high-stakes categories (CSAM, violence) have strict thresholds; lower-stakes categories (mild profanity, controversial opinions) are more permissive

How Defenders Prioritize

Understanding defender priorities tells you where defenses are strong and where they are weak:

Priority 1: Compliance and Legal Risk

Regulated content categories (CSAM, terrorism, regulated advice) receive the most defense investment because failures have legal consequences. These are the hardest categories to bypass.

Priority 2: Brand and Reputation Risk

Content that would embarrass the organization if screenshots were shared on social media. Coverage varies widely -- defenders protect against obvious cases but miss subtle or context-dependent ones.

Priority 3: Misuse Enabling

Preventing the model from helping with harmful activities (weapons, malware, fraud). Coverage is inconsistent because the line between educational and enabling content is subjective.

Priority 4: Data Protection

Preventing leakage of training data, system prompts, or user data. Often under-invested because the risk feels less immediate than content generation risks.

Mental Models for Better Red Teaming

The "What Would I Defend?" Exercise

Before testing a system, ask yourself: if you were the defender, what would you protect first? What would you skip? The things you would skip are likely undefended.

The "Defense Debt" Model

Like technical debt, organizations accumulate defense debt -- known gaps they have not addressed. Common sources:

Rapid feature development outpacing security review
New modalities (images, audio) added without corresponding filters
Tool integrations that bypass the main input pipeline
Legacy endpoints with older (or no) filtering

The "Last Audited" Heuristic

Defenses degrade over time as attack techniques evolve. The less recently a defense was updated, the more likely it has known bypasses. Ask during scoping: when was the last security review of the AI components?

Writing Findings That Resonate

Understanding the defender's perspective also makes your red team reports more actionable:

Frame findings in business risk terms
Not "I bypassed the input filter" but "An attacker can cause the model to generate content that violates [specific compliance requirement], creating legal exposure."
Acknowledge defense complexity
Show that you understand the tradeoffs. Recommend specific threshold adjustments rather than vague "improve filtering."
Provide reproducible test cases
Defenders need to validate the fix. Include exact payloads, expected vs. actual behavior, and bypass rates across multiple attempts.
Map to their existing framework
If they use OWASP Top 10 for LLMs, map your findings to those categories. Speak their language.

References

"OWASP Top 10 for LLM Applications" - OWASP (2025) - Industry-standard risk framework that defenders use to prioritize, making it essential for red teamers to understand
"NIST AI Risk Management Framework (AI RMF 1.0)" - NIST (2023) - Risk assessment framework used by enterprise defenders to evaluate AI system threats
"The Precision-Recall Trade-off in Machine Learning" - Saito & Rehmsmeier (2015) - Foundational work on the classification threshold tradeoff that directly applies to content filter tuning
"Threat Modeling: Designing for Security" - Shostack, Adam (2014) - The STRIDE framework adapted for AI systems in this page's risk assessment section

Knowledge Check

A defender has tuned their content filter to a 0.5% false positive rate. As a red teamer, how should this inform your payload design?

Edit this page on GitHub

Thinking Like a Defender

intermediate8 min readUpdated 2026-03-13

Mental models for defensive thinking, risk assessment frameworks, defense tradeoffs, and why understanding the defender's perspective makes you a better red teamer.

defender-mindset risk-assessment tradeoffs false-positives threat-modeling

The Defender's Constraints

Defenders operate under constraints that attackers do not face. These constraints create systematic blind spots:

Constraint	Impact on Defense	Attacker Opportunity
Budget	Cannot deploy every tool; must prioritize	Untargeted surfaces are undefended
Latency	Heavy filtering adds response time; users complain	Real-time filters must be fast, limiting complexity
False positive rate	Blocking legitimate users is costly	Defenders loosen filters, creating gaps
Maintenance burden	Rules need updating as attacks evolve	Stale rules miss new techniques
User experience	Over-restriction drives users to competitors	Business pressure keeps filters permissive
Observability	Cannot inspect everything in real time	Low-and-slow attacks evade detection

Risk Assessment Frameworks

The AI Risk Matrix

Defenders typically assess threats along two axes:

	Low Likelihood	High Likelihood
High Impact	Advanced attacks (activation steering, model extraction) -- defended if known, often ignored	Common jailbreaks, prompt injection -- primary defense investment
Low Impact	Edge cases, novelty attacks -- rarely defended	Spam, low-quality abuse -- automated defenses

STRIDE for AI Systems

Adapted from traditional threat modeling:

Threat	AI Manifestation	Typical Defense Priority
Spoofing	Impersonating system prompt authority	Medium
Tampering	Modifying model inputs/context	High
Repudiation	Lack of audit trails for model actions	Low
Information Disclosure	System prompt leakage, training data extraction	Medium-High
Denial of Service	Token exhaustion, context window flooding	Medium
Elevation of Privilege	Prompt injection gaining tool access	High

The False Positive / False Negative Tradeoff

This is the central tension in AI defense. Every filter has a detection threshold, and moving it in either direction has costs:

More Strict (lower threshold)          More Permissive (higher threshold)
←─────────────────────────────────────────────────────────────────────→
Catches more attacks                   Misses more attacks
Blocks more legitimate users           Better user experience
Higher false positive rate             Higher false negative rate
Users complain, business pushes back   Attacks succeed, security team concerned

Exploiting the Tradeoff

Defenders typically tune their filters to achieve a false positive rate below 1%. This means:

Borderline inputs pass -- payloads that are 51% benign-looking will slip through filters tuned for low FP
Category boundaries are soft -- content that is ambiguous between categories (educational vs. harmful, fictional vs. instructional) falls through
Threshold varies by category -- high-stakes categories (CSAM, violence) have strict thresholds; lower-stakes categories (mild profanity, controversial opinions) are more permissive

Rapid feature development outpacing security review
New modalities (images, audio) added without corresponding filters
Tool integrations that bypass the main input pipeline
Legacy endpoints with older (or no) filtering

The "Last Audited" Heuristic

Writing Findings That Resonate

Understanding the defender's perspective also makes your red team reports more actionable:

Frame findings in business risk terms
Not "I bypassed the input filter" but "An attacker can cause the model to generate content that violates [specific compliance requirement], creating legal exposure."
Acknowledge defense complexity
Show that you understand the tradeoffs. Recommend specific threshold adjustments rather than vague "improve filtering."
Provide reproducible test cases
Defenders need to validate the fix. Include exact payloads, expected vs. actual behavior, and bypass rates across multiple attempts.
Map to their existing framework
If they use OWASP Top 10 for LLMs, map your findings to those categories. Speak their language.

References

"OWASP Top 10 for LLM Applications" - OWASP (2025) - Industry-standard risk framework that defenders use to prioritize, making it essential for red teamers to understand
"NIST AI Risk Management Framework (AI RMF 1.0)" - NIST (2023) - Risk assessment framework used by enterprise defenders to evaluate AI system threats
"The Precision-Recall Trade-off in Machine Learning" - Saito & Rehmsmeier (2015) - Foundational work on the classification threshold tradeoff that directly applies to content filter tuning
"Threat Modeling: Designing for Security" - Shostack, Adam (2014) - The STRIDE framework adapted for AI systems in this page's risk assessment section

Knowledge Check

A defender has tuned their content filter to a 0.5% false positive rate. As a red teamer, how should this inform your payload design?

Edit this page on GitHub

Thinking Like a Defender

Frame findings in business risk terms

Acknowledge defense complexity

Provide reproducible test cases

Map to their existing framework

Related articles

Thinking Like a Defender

Frame findings in business risk terms

Acknowledge defense complexity

Provide reproducible test cases

Map to their existing framework

Related articles