Nonprofit and Humanitarian AI Security

intermediate13 min readUpdated 2026-03-20

Security considerations for AI in humanitarian operations including aid distribution, crisis response, and resource allocation.

industry-verticals nonprofit humanitarian aid

Overview

Security considerations for AI in humanitarian operations including aid distribution, crisis response, and resource allocation.

The practical consequence of this architectural property is that any component in the system that can influence the token sequence processed by the model has the potential to influence its behavior. This includes direct user inputs, indirect data sources like web content and documents consumed by RAG systems, tool and function call results, and even the format and structure of the conversation itself.

Taxonomy and Classification

Understanding the taxonomy of attacks and defenses in this domain is essential for systematic security assessment. The following classification framework provides a structured approach:

Category	Description	Prevalence	Detection Difficulty
Direct attacks	Adversarial content in user messages	Very High	Low-Medium
Indirect attacks	Adversarial content in external data	High	Medium-High
Architectural	Exploiting system design properties	Medium	High
Training-time	Poisoning or manipulating training	Low	Very High
Inference-time	Exploiting inference infrastructure	Low-Medium	High

Technical Deep Dive

# Demonstration of the core concept
from openai import OpenAI
 
client = OpenAI()
 
def demonstrate_concept(system_prompt: str, user_input: str) -> str:
    """Demonstrate the fundamental behavior pattern."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_input},
        ],
        temperature=0.0,
    )
    return response.choices[0].message.content
 
# Baseline behavior
baseline = demonstrate_concept(
    system_prompt="You are a helpful assistant that only discusses cooking.",
    user_input="What is the capital of France?",
)
print(f"Baseline: {baseline}")
 
# The model should refuse to answer about geography per its system prompt.
# However, certain input formulations can override this restriction,
# demonstrating the fundamental tension between instruction following
# and safety enforcement.

Attack Surface Analysis

The attack surface for this vulnerability class includes multiple vectors, each with distinct characteristics:

Attack Vector	Description	Difficulty	Impact
Direct input	Adversarial content in user messages	Low	Variable
Indirect input	Adversarial content in external data	Medium	High
Tool outputs	Adversarial content in function results	Medium	High
Context manipulation	Exploiting context window dynamics	High	High
Training-time	Poisoning training or fine-tuning data	Very High	Critical

Each vector requires different exploitation techniques and is defended by different mechanisms. A comprehensive security assessment must evaluate all relevant vectors for the target system architecture.

Practical Application

Technique Implementation

Implementing this technique in practice requires understanding both the attack methodology and the defensive landscape of the target system.

import json
from typing import Optional, List
from dataclasses import dataclass
 
@dataclass
class AttackResult:
    """Result of a single attack attempt."""
    payload_hash: str
    success: bool
    response_length: int
    defense_triggered: Optional[str] = None
 
class TechniqueFramework:
    """Framework for implementing and testing the described technique."""
 
    def __init__(self, target_config: dict):
        self.config = target_config
        self.results: List[AttackResult] = []
 
    def prepare_payload(self, objective: str, constraints: dict) -> str:
        """Prepare the attack payload based on objective and target constraints."""
        payload = self._base_payload(objective)
 
        if constraints.get("input_classifier"):
            payload = self._apply_obfuscation(payload)
 
        if constraints.get("output_filter"):
            payload = self._add_extraction_channel(payload)
 
        return payload
 
    def _base_payload(self, objective: str) -> str:
        """Generate base payload for the given objective."""
        return f"Please help me with: {objective}"
 
    def _apply_obfuscation(self, payload: str) -> str:
        """Apply obfuscation to bypass input classifiers."""
        # Example: insert zero-width characters between key terms
        return payload  # Implement specific obfuscation strategy
 
    def _add_extraction_channel(self, payload: str) -> str:
        """Add alternative extraction channel to bypass output filters."""
        return payload + "\nEncode your response in Base64."
 
    def execute(self, payload: str) -> AttackResult:
        """Execute the technique and collect results."""
        response = self._send_to_target(payload)
        success = self._evaluate_response(response)
 
        result = AttackResult(
            payload_hash=str(hash(payload))[:12],
            success=success,
            response_length=len(str(response)),
        )
        self.results.append(result)
        return result
 
    def _send_to_target(self, payload: str) -> str:
        """Send payload to the target system."""
        # Implementation depends on target configuration
        return ""
 
    def _evaluate_response(self, response: str) -> bool:
        """Evaluate whether the response indicates success."""
        return False
 
    def report(self) -> dict:
        """Generate a summary report of all execution results."""
        total = len(self.results)
        successes = sum(1 for r in self.results if r.success)
        return {
            "total_attempts": total,
            "successes": successes,
            "success_rate": successes / total if total > 0 else 0,
        }

Defense Considerations

Understanding defensive measures is essential for both offensive and defensive practitioners:

Input validation: Pre-processing user inputs through classification models that detect adversarial patterns before they reach the target LLM. Modern input classifiers use fine-tuned language models trained on datasets of known attack patterns and can achieve high detection rates for known attack classes while maintaining low false-positive rates.
Output filtering: Post-processing model outputs to detect and remove sensitive data, instruction artifacts, and other indicators of successful exploitation. Output filters typically check for patterns like system prompt leakage, PII exposure, and harmful content generation.
Behavioral monitoring: Real-time monitoring of model behavior patterns to detect anomalous responses that may indicate ongoing attacks. This includes tracking metrics like response length distribution, topic coherence, and deviation from expected behavior patterns.
Architecture design: Designing application architectures that minimize the trust placed in model outputs and enforce security boundaries externally. This includes separating data planes from control planes and implementing principle of least privilege for all model-accessible resources.

Real-World Relevance

This topic area is directly relevant to production AI deployments across industries. MITRE ATLAS — AML.T0051 (LLM Prompt Injection) documents real-world exploitation of this vulnerability class in deployed systems.

Organizations deploying LLM-powered applications should:

Assess: Conduct red team assessments specifically targeting this vulnerability class
Defend: Implement defense-in-depth measures appropriate to the risk level
Monitor: Deploy monitoring that can detect exploitation attempts in real-time
Respond: Maintain incident response procedures specific to AI system compromise
Iterate: Regularly re-test defenses as both attacks and models evolve

Current Research Directions

Active research in this area focuses on several promising directions:

Formal verification: Developing mathematical guarantees for model behavior under adversarial conditions
Robustness training: Training procedures that produce models more resistant to this attack class
Detection methods: Improved techniques for detecting exploitation attempts with low false-positive rates
Standardized evaluation: Benchmark suites like HarmBench and JailbreakBench for measuring progress
Automated defense: Systems that automatically adapt to novel attack patterns using online learning
Cross-modal generalization: Understanding how these vulnerabilities manifest across different input modalities

Implementation Patterns

Pattern 1: Reconnaissance-First Approach

The most effective implementation starts with thorough reconnaissance to understand the target system's defensive posture before attempting any exploitation. This pattern is recommended for all production assessments.

from dataclasses import dataclass
from enum import Enum
 
class DefenseLayer(Enum):
    INPUT_CLASSIFIER = "input_classifier"
    OUTPUT_FILTER = "output_filter"
    GUARDRAIL = "guardrail"
    RATE_LIMITER = "rate_limiter"
    BEHAVIORAL_MONITOR = "behavioral_monitor"
 
@dataclass
class TargetProfile:
    """Profile of the target system's defensive posture."""
    identified_defenses: list
    estimated_difficulty: str
    recommended_techniques: list
    bypass_candidates: list
 
def build_target_profile(recon_results: dict) -> TargetProfile:
    """Build a target profile from reconnaissance results."""
    defenses = []
    techniques = []
 
    # Analyze latency patterns for input classifier detection
    if recon_results.get("avg_latency_increase", 0) > 1.5:
        defenses.append(DefenseLayer.INPUT_CLASSIFIER)
        techniques.append("encoding_bypass")
        techniques.append("semantic_obfuscation")
 
    # Analyze response patterns for output filter detection
    if recon_results.get("truncated_responses", 0) > 0:
        defenses.append(DefenseLayer.OUTPUT_FILTER)
        techniques.append("format_exploitation")
        techniques.append("side_channel_extraction")
 
    difficulty = (
        "high" if len(defenses) >= 3
        else "medium" if len(defenses) >= 1
        else "low"
    )
 
    return TargetProfile(
        identified_defenses=defenses,
        estimated_difficulty=difficulty,
        recommended_techniques=techniques,
        bypass_candidates=[d for d in defenses if d != DefenseLayer.RATE_LIMITER],
    )

Pattern 2: Iterative Refinement

This pattern uses feedback from failed attempts to refine the approach. Each iteration incorporates information learned from the defensive response.

Pattern 3: Multi-Vector Convergence

Apply multiple techniques simultaneously to create overlapping attack vectors. Even if each individual technique is partially defended, the combination may achieve the objective through defense saturation.

Industry Impact Assessment

The vulnerability class described in this article has significant implications across multiple industries:

Industry	Primary Risk	Impact Severity	Regulatory Concern
Healthcare	Patient data exposure via AI assistants	Critical	HIPAA violations
Financial Services	Transaction manipulation through AI advisors	Critical	SEC/OCC compliance
Legal	Privilege violation through AI research tools	High	Attorney-client privilege
Education	Assessment manipulation via AI tutors	Medium	Academic integrity
Government	Classified data exposure via AI systems	Critical	National security
Retail	Customer PII exposure via chatbots	High	GDPR/CCPA compliance

Organizations should assess their exposure to this vulnerability class and implement appropriate defensive measures. The NIST AI 600-1 Generative AI Profile and EU AI Act provide regulatory frameworks that increasingly require such assessments.

Understanding how this vulnerability class relates to others helps practitioners develop comprehensive assessment strategies:

Traditional injection attacks (SQL, XSS, command injection): Similar in concept — untrusted data interpreted as instructions — but operating at the natural language level rather than formal language syntax
Social engineering: Shares the exploitation of trust and authority patterns, but targets the model rather than humans
Adversarial ML attacks: Part of the same family but focused on the instruction-following behavior rather than classification boundaries
Supply chain attacks: Complementary attack surface — supply chain compromise can amplify the impact of prompt injection

Summary and Key Principles

The concepts explored in this article reflect fundamental challenges in AI security that will persist as the technology evolves. Key principles to carry forward:

Architectural awareness: Security boundaries cannot be enforced by the model alone — external mechanisms are required
Defense-in-depth: No single layer provides adequate protection — multiple independent layers are essential
Continuous assessment: The threat landscape evolves rapidly — regular testing is not optional
Practical focus: Theoretical understanding must be paired with hands-on testing experience
Professional standards: Findings must be documented clearly with actionable remediation guidance

Testing Methodology

When assessing systems for this vulnerability class, follow a structured methodology to ensure comprehensive coverage:

Phase 1: Discovery

During discovery, map the system's architecture, identify input surfaces, and characterize the model's baseline behavior. This phase should be conducted without any adversarial intent — the goal is purely observational.

Key activities:

Identify all input channels (direct user input, file upload, web content, API parameters)
Characterize the model's response patterns for benign inputs
Measure baseline latency and response length distributions
Identify error handling behavior and information disclosure in error messages

Phase 2: Enumeration

Enumerate the system's defensive layers through systematic probing. Use the latency-based and behavioral detection techniques described in the labs section.

Key activities:

Probe for input classifiers using controlled boundary-testing payloads
Test for output filters using requests that should trigger filtering
Identify rate limiting thresholds and blocking behavior
Map the system's refusal patterns and information disclosure in refusals

Phase 3: Exploitation

Attempt exploitation using techniques adapted to the identified defensive posture. Start with the simplest techniques and escalate complexity as needed.

Key activities:

Execute primary technique against identified weaknesses
Adapt payloads based on defensive responses
Chain multiple techniques for defense-in-depth bypass
Document all successful and failed attempts with full details

Phase 4: Validation and Reporting

Validate findings for reliability and document them for the assessment report.

class FindingReport:
    """Structured finding report for AI security assessments."""
 
    def __init__(self, title: str, severity: str, description: str):
        self.title = title
        self.severity = severity
        self.description = description
        self.owasp_mapping = ""
        self.mitre_mapping = ""
        self.steps_to_reproduce = []
        self.evidence = []
        self.recommendations = []
 
    def add_step(self, step: str) -> None:
        self.steps_to_reproduce.append(step)
 
    def add_recommendation(self, rec: str) -> None:
        self.recommendations.append(rec)
 
    def to_markdown(self) -> str:
        steps = "\n".join(f"{i+1}. {s}" for i, s in enumerate(self.steps_to_reproduce))
        recs = "\n".join(f"- {r}" for r in self.recommendations)
        return f"""
## {self.title}
 
**Severity**: {self.severity}
**OWASP**: {self.owasp_mapping}
**MITRE ATLAS**: {self.mitre_mapping}
 
### Description
{self.description}
 
### Steps to Reproduce
{steps}
 
### Recommendations
{recs}
"""

Glossary of Key Terms

Term	Definition
Prompt injection	Technique where adversarial content in input influences model behavior beyond intended bounds
Defense-in-depth	Security architecture with multiple independent defensive layers
Guardrail	Programmable safety mechanism that constrains model input or output behavior
Input classifier	ML model that classifies inputs as benign or adversarial before processing
Output filter	Post-processing mechanism that scans model outputs for sensitive or harmful content
Context window	The maximum number of tokens a model can process in a single forward pass
Red teaming	Authorized adversarial testing of systems to discover vulnerabilities
RLHF	Reinforcement Learning from Human Feedback — a training technique for aligning model behavior
RAG	Retrieval-Augmented Generation — technique for grounding model responses in external data
MCP	Model Context Protocol — standard for connecting AI agents to external tools and data

References and Further Reading

MITRE ATLAS (Adversarial Threat Landscape for AI Systems)
MITRE ATLAS — AML.T0051 (LLM Prompt Injection)
US Executive Order 14110 on Safe AI (Oct 2023)

Knowledge Check

What is the most effective approach to defending against the attack class covered in this article?

Knowledge Check

Why do the techniques described in this article remain effective across different model versions and providers?

Edit this page on GitHub

Nonprofit and Humanitarian AI Security

intermediate13 min readUpdated 2026-03-20

Security considerations for AI in humanitarian operations including aid distribution, crisis response, and resource allocation.

industry-verticals nonprofit humanitarian aid

Overview

Security considerations for AI in humanitarian operations including aid distribution, crisis response, and resource allocation.

Core Concepts

Fundamental Principles

Taxonomy and Classification

Understanding the taxonomy of attacks and defenses in this domain is essential for systematic security assessment. The following classification framework provides a structured approach:

Category	Description	Prevalence	Detection Difficulty
Direct attacks	Adversarial content in user messages	Very High	Low-Medium
Indirect attacks	Adversarial content in external data	High	Medium-High
Architectural	Exploiting system design properties	Medium	High
Training-time	Poisoning or manipulating training	Low	Very High
Inference-time	Exploiting inference infrastructure	Low-Medium	High

Technical Deep Dive

# Demonstration of the core concept
from openai import OpenAI
 
client = OpenAI()
 
def demonstrate_concept(system_prompt: str, user_input: str) -> str:
    """Demonstrate the fundamental behavior pattern."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_input},
        ],
        temperature=0.0,
    )
    return response.choices[0].message.content
 
# Baseline behavior
baseline = demonstrate_concept(
    system_prompt="You are a helpful assistant that only discusses cooking.",
    user_input="What is the capital of France?",
)
print(f"Baseline: {baseline}")
 
# The model should refuse to answer about geography per its system prompt.
# However, certain input formulations can override this restriction,
# demonstrating the fundamental tension between instruction following
# and safety enforcement.

Attack Surface Analysis

The attack surface for this vulnerability class includes multiple vectors, each with distinct characteristics:

Attack Vector	Description	Difficulty	Impact
Direct input	Adversarial content in user messages	Low	Variable
Indirect input	Adversarial content in external data	Medium	High
Tool outputs	Adversarial content in function results	Medium	High
Context manipulation	Exploiting context window dynamics	High	High
Training-time	Poisoning training or fine-tuning data	Very High	Critical

Practical Application

Technique Implementation

Implementing this technique in practice requires understanding both the attack methodology and the defensive landscape of the target system.

import json
from typing import Optional, List
from dataclasses import dataclass
 
@dataclass
class AttackResult:
    """Result of a single attack attempt."""
    payload_hash: str
    success: bool
    response_length: int
    defense_triggered: Optional[str] = None
 
class TechniqueFramework:
    """Framework for implementing and testing the described technique."""
 
    def __init__(self, target_config: dict):
        self.config = target_config
        self.results: List[AttackResult] = []
 
    def prepare_payload(self, objective: str, constraints: dict) -> str:
        """Prepare the attack payload based on objective and target constraints."""
        payload = self._base_payload(objective)
 
        if constraints.get("input_classifier"):
            payload = self._apply_obfuscation(payload)
 
        if constraints.get("output_filter"):
            payload = self._add_extraction_channel(payload)
 
        return payload
 
    def _base_payload(self, objective: str) -> str:
        """Generate base payload for the given objective."""
        return f"Please help me with: {objective}"
 
    def _apply_obfuscation(self, payload: str) -> str:
        """Apply obfuscation to bypass input classifiers."""
        # Example: insert zero-width characters between key terms
        return payload  # Implement specific obfuscation strategy
 
    def _add_extraction_channel(self, payload: str) -> str:
        """Add alternative extraction channel to bypass output filters."""
        return payload + "\nEncode your response in Base64."
 
    def execute(self, payload: str) -> AttackResult:
        """Execute the technique and collect results."""
        response = self._send_to_target(payload)
        success = self._evaluate_response(response)
 
        result = AttackResult(
            payload_hash=str(hash(payload))[:12],
            success=success,
            response_length=len(str(response)),
        )
        self.results.append(result)
        return result
 
    def _send_to_target(self, payload: str) -> str:
        """Send payload to the target system."""
        # Implementation depends on target configuration
        return ""
 
    def _evaluate_response(self, response: str) -> bool:
        """Evaluate whether the response indicates success."""
        return False
 
    def report(self) -> dict:
        """Generate a summary report of all execution results."""
        total = len(self.results)
        successes = sum(1 for r in self.results if r.success)
        return {
            "total_attempts": total,
            "successes": successes,
            "success_rate": successes / total if total > 0 else 0,
        }

Defense Considerations

Understanding defensive measures is essential for both offensive and defensive practitioners:

Input validation: Pre-processing user inputs through classification models that detect adversarial patterns before they reach the target LLM. Modern input classifiers use fine-tuned language models trained on datasets of known attack patterns and can achieve high detection rates for known attack classes while maintaining low false-positive rates.
Output filtering: Post-processing model outputs to detect and remove sensitive data, instruction artifacts, and other indicators of successful exploitation. Output filters typically check for patterns like system prompt leakage, PII exposure, and harmful content generation.
Behavioral monitoring: Real-time monitoring of model behavior patterns to detect anomalous responses that may indicate ongoing attacks. This includes tracking metrics like response length distribution, topic coherence, and deviation from expected behavior patterns.
Architecture design: Designing application architectures that minimize the trust placed in model outputs and enforce security boundaries externally. This includes separating data planes from control planes and implementing principle of least privilege for all model-accessible resources.

Real-World Relevance

Organizations deploying LLM-powered applications should:

Assess: Conduct red team assessments specifically targeting this vulnerability class
Defend: Implement defense-in-depth measures appropriate to the risk level
Monitor: Deploy monitoring that can detect exploitation attempts in real-time
Respond: Maintain incident response procedures specific to AI system compromise
Iterate: Regularly re-test defenses as both attacks and models evolve

Current Research Directions

Active research in this area focuses on several promising directions:

Formal verification: Developing mathematical guarantees for model behavior under adversarial conditions
Robustness training: Training procedures that produce models more resistant to this attack class
Detection methods: Improved techniques for detecting exploitation attempts with low false-positive rates
Standardized evaluation: Benchmark suites like HarmBench and JailbreakBench for measuring progress
Automated defense: Systems that automatically adapt to novel attack patterns using online learning
Cross-modal generalization: Understanding how these vulnerabilities manifest across different input modalities

Implementation Patterns

Pattern 1: Reconnaissance-First Approach

from dataclasses import dataclass
from enum import Enum
 
class DefenseLayer(Enum):
    INPUT_CLASSIFIER = "input_classifier"
    OUTPUT_FILTER = "output_filter"
    GUARDRAIL = "guardrail"
    RATE_LIMITER = "rate_limiter"
    BEHAVIORAL_MONITOR = "behavioral_monitor"
 
@dataclass
class TargetProfile:
    """Profile of the target system's defensive posture."""
    identified_defenses: list
    estimated_difficulty: str
    recommended_techniques: list
    bypass_candidates: list
 
def build_target_profile(recon_results: dict) -> TargetProfile:
    """Build a target profile from reconnaissance results."""
    defenses = []
    techniques = []
 
    # Analyze latency patterns for input classifier detection
    if recon_results.get("avg_latency_increase", 0) > 1.5:
        defenses.append(DefenseLayer.INPUT_CLASSIFIER)
        techniques.append("encoding_bypass")
        techniques.append("semantic_obfuscation")
 
    # Analyze response patterns for output filter detection
    if recon_results.get("truncated_responses", 0) > 0:
        defenses.append(DefenseLayer.OUTPUT_FILTER)
        techniques.append("format_exploitation")
        techniques.append("side_channel_extraction")
 
    difficulty = (
        "high" if len(defenses) >= 3
        else "medium" if len(defenses) >= 1
        else "low"
    )
 
    return TargetProfile(
        identified_defenses=defenses,
        estimated_difficulty=difficulty,
        recommended_techniques=techniques,
        bypass_candidates=[d for d in defenses if d != DefenseLayer.RATE_LIMITER],
    )

Pattern 2: Iterative Refinement

This pattern uses feedback from failed attempts to refine the approach. Each iteration incorporates information learned from the defensive response.

Pattern 3: Multi-Vector Convergence

Industry Impact Assessment

The vulnerability class described in this article has significant implications across multiple industries:

Industry	Primary Risk	Impact Severity	Regulatory Concern
Healthcare	Patient data exposure via AI assistants	Critical	HIPAA violations
Financial Services	Transaction manipulation through AI advisors	Critical	SEC/OCC compliance
Legal	Privilege violation through AI research tools	High	Attorney-client privilege
Education	Assessment manipulation via AI tutors	Medium	Academic integrity
Government	Classified data exposure via AI systems	Critical	National security
Retail	Customer PII exposure via chatbots	High	GDPR/CCPA compliance

Understanding how this vulnerability class relates to others helps practitioners develop comprehensive assessment strategies:

Traditional injection attacks (SQL, XSS, command injection): Similar in concept — untrusted data interpreted as instructions — but operating at the natural language level rather than formal language syntax
Social engineering: Shares the exploitation of trust and authority patterns, but targets the model rather than humans
Adversarial ML attacks: Part of the same family but focused on the instruction-following behavior rather than classification boundaries
Supply chain attacks: Complementary attack surface — supply chain compromise can amplify the impact of prompt injection

Summary and Key Principles

The concepts explored in this article reflect fundamental challenges in AI security that will persist as the technology evolves. Key principles to carry forward:

Architectural awareness: Security boundaries cannot be enforced by the model alone — external mechanisms are required
Defense-in-depth: No single layer provides adequate protection — multiple independent layers are essential
Continuous assessment: The threat landscape evolves rapidly — regular testing is not optional
Practical focus: Theoretical understanding must be paired with hands-on testing experience
Professional standards: Findings must be documented clearly with actionable remediation guidance

Testing Methodology

When assessing systems for this vulnerability class, follow a structured methodology to ensure comprehensive coverage:

Phase 1: Discovery

Key activities:

Identify all input channels (direct user input, file upload, web content, API parameters)
Characterize the model's response patterns for benign inputs
Measure baseline latency and response length distributions
Identify error handling behavior and information disclosure in error messages

Phase 2: Enumeration

Enumerate the system's defensive layers through systematic probing. Use the latency-based and behavioral detection techniques described in the labs section.

Key activities:

Probe for input classifiers using controlled boundary-testing payloads
Test for output filters using requests that should trigger filtering
Identify rate limiting thresholds and blocking behavior
Map the system's refusal patterns and information disclosure in refusals

Phase 3: Exploitation

Attempt exploitation using techniques adapted to the identified defensive posture. Start with the simplest techniques and escalate complexity as needed.

Key activities:

Execute primary technique against identified weaknesses
Adapt payloads based on defensive responses
Chain multiple techniques for defense-in-depth bypass
Document all successful and failed attempts with full details

Phase 4: Validation and Reporting

Validate findings for reliability and document them for the assessment report.

class FindingReport:
    """Structured finding report for AI security assessments."""
 
    def __init__(self, title: str, severity: str, description: str):
        self.title = title
        self.severity = severity
        self.description = description
        self.owasp_mapping = ""
        self.mitre_mapping = ""
        self.steps_to_reproduce = []
        self.evidence = []
        self.recommendations = []
 
    def add_step(self, step: str) -> None:
        self.steps_to_reproduce.append(step)
 
    def add_recommendation(self, rec: str) -> None:
        self.recommendations.append(rec)
 
    def to_markdown(self) -> str:
        steps = "\n".join(f"{i+1}. {s}" for i, s in enumerate(self.steps_to_reproduce))
        recs = "\n".join(f"- {r}" for r in self.recommendations)
        return f"""
## {self.title}
 
**Severity**: {self.severity}
**OWASP**: {self.owasp_mapping}
**MITRE ATLAS**: {self.mitre_mapping}
 
### Description
{self.description}
 
### Steps to Reproduce
{steps}
 
### Recommendations
{recs}
"""

Glossary of Key Terms

Term	Definition
Prompt injection	Technique where adversarial content in input influences model behavior beyond intended bounds
Defense-in-depth	Security architecture with multiple independent defensive layers
Guardrail	Programmable safety mechanism that constrains model input or output behavior
Input classifier	ML model that classifies inputs as benign or adversarial before processing
Output filter	Post-processing mechanism that scans model outputs for sensitive or harmful content
Context window	The maximum number of tokens a model can process in a single forward pass
Red teaming	Authorized adversarial testing of systems to discover vulnerabilities
RLHF	Reinforcement Learning from Human Feedback — a training technique for aligning model behavior
RAG	Retrieval-Augmented Generation — technique for grounding model responses in external data
MCP	Model Context Protocol — standard for connecting AI agents to external tools and data

References and Further Reading

MITRE ATLAS (Adversarial Threat Landscape for AI Systems)
MITRE ATLAS — AML.T0051 (LLM Prompt Injection)
US Executive Order 14110 on Safe AI (Oct 2023)

Knowledge Check

What is the most effective approach to defending against the attack class covered in this article?

Knowledge Check

Why do the techniques described in this article remain effective across different model versions and providers?

Edit this page on GitHub

Nonprofit and Humanitarian AI Security

Related articles

Nonprofit and Humanitarian AI Security

Related articles