AI Penetration Testing Report Writing
Comprehensive guide to writing effective penetration testing reports for AI system assessments.
Overview
The penetration testing report is the primary deliverable of any security assessment. For AI system assessments, report writing presents unique challenges that traditional pentest reporting guides do not address. AI vulnerabilities are often probabilistic (working some percentage of the time rather than deterministically), context-dependent (requiring specific conversation history or system state), and difficult to categorize using existing vulnerability classification schemes. Findings may involve generated content that is harmful or offensive, raising questions about what to include in evidence.
And many stakeholders lack the background to understand why a prompt injection or safety bypass matters for their organization.
This article provides a detailed guide to writing AI penetration testing reports that are technically rigorous, clearly communicate risk, and drive remediation action. We cover report structure, finding documentation standards, severity classification for AI-specific vulnerabilities, evidence handling, and common pitfalls that undermine report effectiveness.
Report Structure
Overall Organization
An effective AI pentest report follows a layered structure that allows different audiences to extract the information they need without reading the entire document.
Cover page and document control: Title, client name, assessment dates, report version, classification level, and distribution restrictions. AI pentest reports may contain examples of harmful content, so classification is particularly important.
Table of contents: Essential for reports that routinely exceed 30 pages.
Executive summary (1-3 pages): Written for non-technical leadership. Describes the assessment scope, overall risk level, key findings summarized in business terms, and strategic recommendations. This section should be comprehensible to someone with no technical background. Avoid jargon. Use terms like "the AI system can be manipulated to reveal customer data" rather than "the target is susceptible to indirect prompt injection via RAG document poisoning enabling PII exfiltration."
Assessment overview (2-3 pages): Describes what was tested, how it was tested, and any constraints or limitations. Include the scope boundaries, testing methodology, tools used, timeline, and any issues encountered during testing. For AI assessments, explicitly state the model version(s) tested, as findings may not persist across model updates.
Summary of findings (1-2 pages): A table or visual summary showing all findings by severity, category, and status. This gives readers a quick overview before diving into details. Include a finding severity distribution chart and a breakdown by vulnerability category (prompt injection, safety bypass, data leakage, etc.).
Detailed findings (variable length): The core of the report. Each finding gets its own section with a consistent structure described below.
Strategic recommendations (2-3 pages): Higher-level recommendations that go beyond individual finding remediation. These address systemic issues, architectural improvements, and process changes that would improve the organization's AI security posture overall.
Appendices: Complete test logs, methodology details, tool configurations, and supplementary evidence.
Finding Documentation Template
Each finding should follow a consistent template. Consistency makes findings easier to review, compare, and track through remediation.
Finding title: Clear, specific, and descriptive. "Direct Prompt Injection Bypasses Content Policy in Customer Support Chatbot" is better than "Prompt Injection Vulnerability." The title should communicate what the vulnerability is and where it exists.
Finding ID: A unique identifier for tracking (e.g., AI-2026-001). Use a consistent numbering scheme across engagements to facilitate cross-referencing.
Severity rating: Use a defined severity scale. We recommend a four-level scale (Critical, High, Medium, Low) supplemented with a CVSS-like scoring approach adapted for AI systems (discussed in the severity classification section below).
Vulnerability category: Map to established taxonomies. Use MITRE ATLAS technique IDs (e.g., AML.T0051 for LLM Prompt Injection), OWASP LLM Top 10 identifiers (e.g., LLM01: Prompt Injection), and CWE identifiers where applicable. Multiple categorizations are appropriate and help clients who use different frameworks.
Description: A clear explanation of what the vulnerability is, written for a technical audience that may not be familiar with AI-specific attack techniques. Include enough context that someone unfamiliar with the specific attack class can understand both the mechanism and the risk. Avoid assuming the reader knows what prompt injection, jailbreaking, or model extraction are — define these terms on first use or provide links to explanations.
Evidence: The most critical section for AI findings. Document the complete interaction sequence including:
- The exact prompts or inputs used
- The system's complete responses
- Timestamps and session identifiers if available
- The model version or system version
- The number of attempts and success rate
- Screenshots or screen recordings for interactive testing
- API request/response logs for programmatic testing
For AI findings, context matters enormously. A single prompt/response pair is often insufficient — document the full conversation sequence that leads to the vulnerability, as many AI vulnerabilities require specific conversational context to trigger.
Impact assessment: Describe the concrete business impact of the vulnerability. What can an attacker achieve? What data is at risk? What decisions could be manipulated? Quantify where possible: "An attacker could extract customer email addresses at a rate of approximately 50 per hour through repeated prompt injection" is more actionable than "an attacker could extract some customer data."
Reproduction steps: Numbered, step-by-step instructions that allow someone unfamiliar with the testing to reproduce the finding. For AI findings, include the complete conversation sequence, any required system configuration or state, and the expected behavior versus observed behavior.
Remediation recommendations: Specific, actionable remediation steps. Provide multiple options where possible, ranging from quick mitigations to comprehensive fixes. For each option, note the expected effectiveness, implementation effort, and any limitations.
References: Links to relevant MITRE ATLAS entries, OWASP guidelines, academic papers, or vendor documentation that provide additional context for the finding.
Severity Classification for AI Vulnerabilities
The Challenge of AI Severity Rating
Traditional vulnerability severity frameworks like CVSS were designed for deterministic software vulnerabilities. AI vulnerabilities present classification challenges:
Probabilistic exploitation: A prompt injection that works 30% of the time is still a real vulnerability, but how does its severity compare to one that works 95% of the time? Traditional frameworks do not have a concept of exploitation probability in this sense.
Context-dependent impact: The same vulnerability (e.g., safety filter bypass) has radically different severity depending on the system's context. Bypassing content filtering on an internal code review tool is a different risk than bypassing it on a public-facing educational chatbot for children.
Subjective content harm: Some AI findings involve the generation of harmful content. Assessing the severity of a safety bypass requires making judgments about content harm that are inherently more subjective than assessing the severity of, say, a buffer overflow.
Cascading effects in agentic systems: A minor prompt injection in an agentic system might cascade through tool calls to produce significant impact. The initial vulnerability may appear minor, but its exploitable impact through the agent's capabilities may be critical.
Recommended Severity Framework
We recommend a severity framework that evaluates three dimensions and combines them into a final rating:
Dimension 1 — Exploitability (1-4 scale):
- 4: Trivially exploitable by any user with no specialized knowledge
- 3: Exploitable with basic knowledge of AI systems or published attack techniques
- 2: Requires specialized knowledge, custom tooling, or extended interaction
- 1: Requires deep expertise, significant resources, or specific pre-conditions
Dimension 2 — Impact (1-4 scale):
- 4: Unauthorized access to sensitive data, unauthorized actions with material consequences, complete safety bypass enabling generation of dangerous content
- 3: Limited data exposure, partial safety bypass, manipulation of AI decisions that affect users
- 2: Information disclosure of system configuration, inconsistent safety enforcement, quality degradation of AI outputs
- 1: Minor information leakage, cosmetic issues, theoretical risks with limited practical exploitation path
Dimension 3 — Scope (1-3 scale):
- 3: Affects all users or the system at an architectural level
- 2: Affects a significant subset of users or use cases
- 1: Affects only the attacking user's session
Final severity:
- Critical: Exploitability 3-4, Impact 4, any Scope
- High: Exploitability 3-4, Impact 3, or Exploitability 2+, Impact 4, Scope 2+
- Medium: Exploitability 2+, Impact 2+, or lower combinations with high Scope
- Low: Exploitability 1-2, Impact 1-2, Scope 1
This framework is a starting point. Adjust severity based on the client's specific context, regulatory environment, and risk appetite. Always document the reasoning behind severity ratings so clients can apply their own judgment.
Handling Probabilistic Findings
For findings with variable success rates, document the success rate prominently and address it in the severity assessment:
"This prompt injection successfully bypassed the content filter in 7 of 20 attempts (35% success rate). While not deterministic, this success rate is sufficient for a motivated attacker to achieve consistent results through repeated attempts. At this success rate, an attacker would expect to achieve a successful bypass within 3-4 attempts on average."
Provide context on why probabilistic findings still matter. Many stakeholders assume that if an attack does not work every time, it is not a real vulnerability. Help them understand that an attacker who can make unlimited attempts (as is typical for API-accessible AI systems) will eventually succeed, and that a 35% success rate means success within seconds of automated testing.
Evidence Handling
Sensitive Content in Evidence
AI red teaming inevitably produces evidence that contains sensitive content: extracted PII, generated harmful material, bypassed safety filter outputs, and system configuration details. Handle this evidence carefully.
Redaction standards: Redact real PII in evidence screenshots and logs before including them in reports. Replace actual names, email addresses, and other identifiers with clear placeholders like [REDACTED-EMAIL-1]. Redaction should be consistent (the same redacted item uses the same placeholder throughout) but thorough.
Harmful content handling: When documenting safety bypass findings, include enough evidence to demonstrate the vulnerability but do not include gratuitously harmful content. A finding that demonstrates the system can be made to produce instructions for dangerous activities does not need to include complete instructions — a truncated example with a note like "[response continued with detailed instructions - truncated for report]" is sufficient to establish the finding.
Content warnings: If the report necessarily includes disturbing or offensive content as evidence, add content warnings at the beginning of the relevant sections. This is both a professional courtesy and a practical consideration for report reviewers.
Secure transmission: Deliver reports through secure channels (encrypted email, secure file sharing platforms, or hand delivery for highly sensitive assessments). AI pentest reports often combine the sensitivity of traditional pentest reports (system vulnerabilities) with additional sensitive content categories unique to AI assessments.
Evidence Completeness
Incomplete evidence is the most common reason findings are disputed or deprioritized. For every finding:
Prove it is real: Include sufficient evidence that someone who was not present during testing can verify the finding is genuine, not fabricated or misinterpreted.
Prove it is reproducible: Provide steps precise enough that the client's team can reproduce the finding in their own testing environment.
Prove it matters: Connect the technical finding to a concrete business impact. "The system can be made to ignore its safety guidelines" is a technical statement. "An attacker can cause the customer-facing chatbot to recommend dangerous dosage information for medications" is a business impact that will drive remediation.
Writing for Different Audiences
Technical Audience
The client's development and security engineers need enough detail to understand the vulnerability mechanism, reproduce the finding, and implement an effective fix.
Be precise about the mechanism: Explain exactly how the attack works. For prompt injection, include the specific technique (role-play escalation, encoding-based bypass, instruction hierarchy manipulation, etc.) and explain why it succeeds against the current implementation.
Provide remediation at the right level: Do not just say "implement input validation." Specify what validation: "Implement a secondary LLM-based classifier that evaluates user inputs for injection attempts before passing them to the primary model. The classifier should be a separate model instance with its own system prompt focused solely on identifying adversarial inputs. Reference the OWASP LLM Top 10 guidance on prompt injection prevention for implementation patterns."
Reference the right standards: Technical audiences benefit from MITRE ATLAS technique IDs, CWE identifiers, and links to relevant academic papers or tool documentation.
Executive Audience
The executive summary must stand alone as a coherent document for readers who will never read the technical details.
Lead with business risk: "The assessment identified three critical vulnerabilities that could allow external attackers to manipulate the AI system into disclosing customer financial data. Based on the system's user base of approximately 50,000 active users, the potential exposure includes names, account numbers, and transaction histories."
Quantify where possible: Use numbers to communicate scale. Number of vulnerabilities by severity, estimated exposure volume, comparison to industry benchmarks if available.
Provide clear priorities: "We recommend immediate remediation of the three critical findings before expanding the system's user base. The estimated remediation effort for these findings is 2-3 weeks of engineering time."
Avoid jargon: If you must use a technical term, define it parenthetically. "The system is vulnerable to prompt injection (a technique where an attacker crafts special inputs that override the system's instructions)."
Board and Regulatory Audience
Reports or summaries prepared for board-level or regulatory review require additional considerations:
Risk framework alignment: Map findings to the risk frameworks the organization uses (NIST AI RMF, ISO 42001, EU AI Act requirements). State explicitly whether identified vulnerabilities represent compliance gaps.
Trend context: Where possible, contextualize findings within the broader threat landscape. "Prompt injection vulnerabilities of this type are classified as the highest-severity risk category in the OWASP Top 10 for LLM Applications and have been identified in similar systems across the industry."
Maturity assessment: Provide an overall assessment of the organization's AI security maturity relative to industry expectations. This helps boards understand not just the specific findings but the organization's overall preparedness.
Common Report Writing Pitfalls
Pitfalls That Undermine Credibility
Overclassifying severity: Rating every finding as Critical or High destroys the credibility of your severity ratings and makes prioritization impossible. Apply severity consistently and do not be afraid of Medium and Low ratings. If the client has a well-secured system, a report with mostly Medium findings and good architectural recommendations is more valuable than one that inflates severity to appear more impactful.
Generic recommendations: "Implement better input validation" is not actionable. Every recommendation should be specific enough that an engineer could begin implementation based solely on the report. Reference specific tools, libraries, patterns, or frameworks where applicable.
Missing context for AI-unfamiliar readers: Many report readers will be encountering AI security vulnerabilities for the first time. Do not assume familiarity with concepts like prompt injection, token limits, temperature settings, or RLHF. Provide enough context that readers can understand the finding without external research.
Inconsistent terminology: Define your terminology in the methodology section and use it consistently. Do not use "prompt injection," "prompt manipulation," and "input override" interchangeably. Pick one term (preferably the OWASP or MITRE standard term) and use it throughout.
Pitfalls That Undermine Actionability
Findings without reproduction steps: If the client cannot reproduce a finding, they cannot verify remediation. Every finding must include complete, tested reproduction steps.
Remediation without priority: Presenting 30 findings without clear prioritization overwhelms the development team. Group findings by system component or remediation theme, and provide a prioritized remediation roadmap that accounts for dependency relationships between fixes.
Missing root cause analysis: A report that identifies 10 instances of prompt injection without noting that they all share a common root cause (e.g., lack of a secondary classification layer) misses the opportunity to recommend a single architectural fix that addresses all 10 findings. Look for patterns and address root causes, not just symptoms.
Recommendations that ignore constraints: Recommending that the client "retrain the model with adversarial examples" when they are using a third-party API model demonstrates a misunderstanding of their constraints. Tailor recommendations to the client's actual ability to implement them.
Report Review and Quality Assurance
Internal Review Process
Every report should undergo at least one peer review before delivery.
Technical review: A fellow practitioner reviews findings for accuracy, completeness, and reproducibility. They should attempt to reproduce key findings using the documented steps.
Editorial review: A reviewer focused on clarity, consistency, and audience appropriateness. Check that the executive summary is genuinely accessible to non-technical readers, that terminology is consistent, and that the report reads coherently from beginning to end.
Compliance review: For engagements with regulatory context, verify that the report addresses the relevant compliance requirements and uses appropriate framework language.
Pre-Delivery Checklist
Before delivering any AI pentest report, verify:
- All finding reproduction steps have been tested
- PII and sensitive data are appropriately redacted
- Harmful content is minimized to what is necessary for evidence
- Severity ratings are consistent and justified
- Executive summary stands alone and is jargon-free
- Model/system versions are documented
- Success rates are documented for probabilistic findings
- All referenced tools and frameworks include version information
- The report classification level is appropriate
- Delivery method meets the client's security requirements
Templates and Examples
Finding Example
Here is a condensed example of a well-documented AI finding:
Finding ID: AI-2026-003
Title: Indirect Prompt Injection via RAG Knowledge Base Enables Customer Data Extraction
Severity: Critical (Exploitability: 4, Impact: 4, Scope: 3)
Category: MITRE ATLAS AML.T0051.001 (LLM Prompt Injection — Indirect), OWASP LLM01 (Prompt Injection)
Description: The customer support chatbot retrieves context from the internal knowledge base using RAG. An attacker who can add content to the knowledge base (e.g., through the publicly accessible feedback form that feeds the knowledge base) can embed instructions that cause the chatbot to extract and disclose information from other users' conversations. The system does not sanitize retrieved knowledge base content or distinguish it from system instructions in the model's context window.
Impact: An attacker could systematically extract customer names, account details, and support history from other users' conversations. Given the system processes approximately 10,000 support conversations daily, the potential exposure is significant. This represents a likely violation of GDPR Article 5(1)(f) regarding data processing integrity and confidentiality.
Success Rate: 17 of 20 attempts (85%) successfully extracted information from simulated customer conversations.
Remediation Recommendations:
- (Immediate) Implement content sanitization for all retrieved knowledge base documents before insertion into the model context
- (Short-term) Deploy a secondary classification model to detect injection attempts in knowledge base content
- (Medium-term) Implement architectural separation between knowledge base retrieval and user conversation context with explicit trust boundaries
References
- MITRE ATLAS (Adversarial Threat Landscape for AI Systems). https://atlas.mitre.org/ — Vulnerability classification taxonomy for AI system findings.
- OWASP Top 10 for LLM Applications, 2025 Edition. https://owasp.org/www-project-top-10-for-large-language-model-applications/ — Risk classification framework for LLM application vulnerabilities.
- NIST AI Risk Management Framework (AI RMF 1.0), January 2023. https://www.nist.gov/artificial-intelligence/ai-risk-management-framework — Framework for contextualizing AI risk findings.
- Common Weakness Enumeration (CWE). https://cwe.mitre.org/ — Software weakness classification used to supplement AI-specific taxonomies.