AI Audit Methodology
Comprehensive methodology for auditing AI systems including planning, evidence collection, testing procedures, report templates, and integration with red team assessments.
An AI audit systematically evaluates whether an organization's AI systems meet defined standards, regulatory requirements, and internal policies. While distinct from red teaming -- which focuses on adversarial exploitation -- auditing and red teaming are complementary activities. Red team findings provide critical evidence for auditors, and audit frameworks help red teamers understand which controls to test and how to report findings.
Audit Planning
Pre-Audit Activities
| Activity | Purpose | Deliverable |
|---|---|---|
| Scope definition | Determine which AI systems, controls, and standards are in scope | Scope document signed by stakeholders |
| Standards identification | Identify applicable standards and regulations | Compliance matrix |
| Team composition | Assemble audit team with required competencies | Team roster with qualifications |
| Timeline and logistics | Plan audit activities, schedule interviews, arrange access | Audit schedule |
| Risk-based prioritization | Focus audit effort on highest-risk areas | Risk-prioritized audit plan |
Audit Team Competencies
AI audits require a mix of skills not typically found in traditional audit teams:
| Role | Competency | Responsibility |
|---|---|---|
| Lead auditor | Audit methodology, standards knowledge, stakeholder management | Overall audit leadership, report quality |
| AI/ML specialist | Model development, training, deployment practices | Technical evaluation of AI systems |
| Data specialist | Data governance, privacy, quality assessment | Data-related control evaluation |
| Security specialist | AI security, adversarial testing, vulnerability assessment | Security control evaluation (red team integration) |
| Domain expert | Industry-specific requirements (healthcare, finance, etc.) | Sector-specific regulatory compliance |
| Ethics/fairness specialist | Bias assessment, fairness metrics, ethical AI | Fairness and bias control evaluation |
Scope Definition Framework
| Scope Element | Questions to Answer |
|---|---|
| AI systems | Which specific AI systems are being audited? Production only, or development and staging too? |
| Lifecycle stages | Which stages are in scope (development, training, deployment, monitoring, retirement)? |
| Control objectives | Which standard's controls apply (ISO 42001 Annex A, SOC 2 TSC, custom controls)? |
| Time period | Point-in-time assessment or period-of-time evaluation? |
| Third parties | Are third-party AI components in scope (foundation models, APIs, data providers)? |
| Exclusions | What is explicitly out of scope and why? |
Evidence Collection
Types of Audit Evidence
| Evidence Type | Description | Reliability | Example |
|---|---|---|---|
| Documentary | Policies, procedures, design documents | Medium (may not reflect practice) | AI governance policy, model cards, impact assessments |
| Observational | Direct observation of processes and controls | High (real-time verification) | Watching a model deployment go through change management |
| Testimonial | Interviews with personnel | Medium (subject to bias) | Interviews with AI engineers, data scientists, risk managers |
| Analytical | Testing and analysis of system behavior | High (objective results) | Red team test results, bias analysis, performance metrics |
| System-generated | Logs, metrics, automated reports | High (objective, timestamped) | Model monitoring logs, access logs, drift detection alerts |
Evidence Collection Procedures
Document request
Issue a formal document request list to the organization covering all in-scope control areas. Allow adequate time for collection (typically 2-4 weeks).
Standard document requests for AI audits:
- AI governance policies and procedures
- AI system inventory with risk classifications
- Model development documentation (model cards, training procedures)
- Data governance documentation (data lineage, quality procedures)
- Risk assessments and impact assessments
- Incident reports and response records
- Change management records for AI systems
- Monitoring and alerting configurations
- Red team and security assessment reports
- Training records for AI development and operations staff
Interview planning
Schedule interviews with key personnel across AI governance, development, operations, and risk management functions.
Key interview subjects:
- AI governance lead / responsible AI officer
- AI/ML engineering leads
- Data engineering leads
- Security and red team leads
- Risk management and compliance
- Business stakeholders using AI outputs
Technical testing
Plan and execute technical testing activities. This is where red team capabilities integrate directly.
Testing areas:
- Control effectiveness testing (does the control work as documented?)
- Configuration review (are systems configured according to policy?)
- Adversarial testing (do controls hold under attack?)
- Performance validation (does the AI system perform as documented?)
Evidence preservation
Maintain a structured evidence repository with clear chain of custody.
Evidence management requirements:
- Unique identifier for each evidence item
- Date and time of collection
- Source of evidence
- Collector name and role
- Control objective the evidence supports
- Integrity verification (hash for digital evidence)
Evidence Quality Criteria
| Criterion | Description | How to Ensure |
|---|---|---|
| Sufficient | Enough evidence to support the conclusion | Collect evidence from multiple sources for each control |
| Appropriate | Evidence is relevant to the control being assessed | Map each evidence item to a specific control objective |
| Reliable | Evidence source is trustworthy and verifiable | Prefer system-generated evidence over testimonial |
| Timely | Evidence reflects the audit period, not outdated information | Verify dates, request current documentation |
| Complete | Evidence covers the full scope of the control | Verify coverage across all in-scope AI systems |
Testing Procedures
Control Testing Approaches
| Approach | Description | When to Use | Confidence Level |
|---|---|---|---|
| Inquiry | Ask personnel how a control works | Initial understanding, low-risk controls | Low |
| Observation | Watch the control operate in real time | Process-based controls | Medium |
| Inspection | Examine documentation and artifacts | Documentation-based controls | Medium |
| Re-performance | Independently execute the control process | Critical controls, high-risk areas | High |
| Adversarial testing | Attempt to bypass or break the control | Security controls, safety controls | Highest |
AI-Specific Testing Procedures
Model governance testing:
| Test | Procedure | Expected Evidence |
|---|---|---|
| Model inventory completeness | Compare documented inventory against discovered AI systems | All AI systems accounted for in the inventory |
| Change management adherence | Select a sample of model deployments and verify change management records | Approved change requests for each deployment |
| Model documentation accuracy | Compare model cards against actual model behavior | Documentation matches tested behavior |
| Version control | Verify model versions in production match approved versions | Version alignment confirmed |
Data governance testing:
| Test | Procedure | Expected Evidence |
|---|---|---|
| Training data provenance | Trace training data to its source and verify authorization | Documented data lineage with consent records |
| Data quality controls | Review data validation procedures and test with sample data | Quality checks operating effectively |
| Data retention compliance | Verify data retention periods match policy and regulatory requirements | Data aged off per schedule |
| Privacy controls | Test for PII in training data and model outputs | PII detection and handling procedures effective |
Security control testing (red team integration):
| Test | Procedure | Expected Evidence |
|---|---|---|
| Prompt injection resistance | Execute prompt injection test suite against production endpoints | Injection attempts blocked or detected |
| Data extraction prevention | Attempt training data and system prompt extraction | Extraction attempts fail or trigger alerts |
| Access control effectiveness | Test API authentication, authorization, and rate limiting | Unauthorized access prevented |
| Monitoring effectiveness | Execute attacks and verify detection | Attacks detected within defined SLA |
Fairness and bias testing:
| Test | Procedure | Expected Evidence |
|---|---|---|
| Demographic parity | Compare model outcomes across protected groups | Outcomes within acceptable disparity thresholds |
| Equal opportunity | Compare error rates across protected groups | Error rates consistent across groups |
| Calibration testing | Verify prediction confidence aligns with actual outcomes across groups | Calibration consistent across groups |
| Intersectional analysis | Test for bias at intersections of protected characteristics | No significant intersectional disparities |
Audit Reporting
Report Structure
| Section | Content | Audience |
|---|---|---|
| Executive summary | Overall assessment, critical findings, key recommendations | Board, C-suite |
| Scope and methodology | What was audited, how, and against which standards | All stakeholders |
| Findings summary | Categorized findings by severity and control area | Management, risk committee |
| Detailed findings | Individual finding descriptions with evidence and recommendations | Technical teams, compliance |
| Control effectiveness matrix | Control-by-control assessment results | Auditors, compliance, risk management |
| Recommendations | Prioritized recommendations with implementation guidance | Technical teams, management |
| Management response | Organization's response to findings and remediation plans | All stakeholders |
Finding Classification
| Severity | Definition | Timeline |
|---|---|---|
| Critical | Control failure that could result in significant harm, regulatory action, or material misstatement | Immediate remediation (within 30 days) |
| High | Control deficiency that significantly increases risk exposure | Remediation within 60 days |
| Medium | Control weakness that should be addressed but does not pose immediate significant risk | Remediation within 90 days |
| Low | Observation or improvement opportunity | Address during next review cycle |
| Advisory | Best practice recommendation with no current deficiency | Consider for implementation |
Finding Documentation Template
| Field | Content |
|---|---|
| Finding ID | Unique identifier |
| Title | Concise description of the finding |
| Severity | Critical / High / Medium / Low / Advisory |
| Control reference | Applicable standard and control (e.g., ISO 42001 A.6.2.4) |
| Condition | What was found (the current state) |
| Criteria | What was expected (the standard or requirement) |
| Cause | Why the gap exists (root cause analysis) |
| Effect | What could happen if unaddressed (risk statement) |
| Recommendation | What should be done to remediate |
| Evidence | References to supporting evidence |
| Management response | Organization's planned remediation |
Red Team Integration Points
Where Red Teaming Fits in the Audit
| Audit Phase | Red Team Contribution |
|---|---|
| Planning | Provide threat intelligence to inform risk-based audit scope |
| Evidence collection | Red team reports serve as analytical evidence of control effectiveness |
| Testing | Adversarial testing provides the highest-confidence assessment of security controls |
| Reporting | Red team findings directly map to audit findings |
| Follow-up | Red team re-testing validates remediation effectiveness |
Translating Red Team Findings to Audit Language
Red teams and auditors use different terminology. This translation table helps:
| Red Team Term | Audit Term |
|---|---|
| Vulnerability | Control deficiency |
| Exploit | Control failure |
| Attack surface | Risk exposure area |
| Severity rating | Finding classification |
| Proof of concept | Supporting evidence |
| Remediation guidance | Recommendation |
| Re-test | Remediation verification |
Continuous Audit Considerations
Traditional audits are point-in-time assessments. For AI systems that change frequently, consider continuous audit approaches:
| Traditional Audit | Continuous Audit |
|---|---|
| Annual or periodic assessment | Ongoing assessment with periodic reporting |
| Manual evidence collection | Automated evidence collection and monitoring |
| Point-in-time control testing | Continuous control monitoring |
| Findings reported at audit completion | Findings reported in real time |
| Manual follow-up on remediation | Automated remediation verification |
Continuous Audit Architecture
AI System → Monitoring Layer → Evidence Repository → Analysis Engine → Dashboard
│ │ │ │ │
├── Logs ├── Automated tests ├── Evidence store ├── Compliance ├── Alerts
├── Metrics ├── Drift detection ├── Chain of custody ├── Trend ├── Reports
└── Events └── Red team tests └── Audit trail └── Gap detect └── KPIs
This architecture enables organizations to maintain audit readiness at all times rather than scrambling to prepare for periodic assessments. Red team test results feed continuously into the evidence repository, providing ongoing assurance of control effectiveness.