Methodology for Auditing AI-Generated Code
Structured audit methodology for evaluating the security of AI-generated code, covering static analysis, dynamic testing, and organizational assessment.
Overview
Auditing AI-generated code requires a different approach than traditional code review. AI-generated code has characteristic vulnerability patterns, lacks the contextual understanding that human developers bring, and is produced at a volume that makes manual review impractical. Organizations need a structured methodology that combines automated tooling with targeted manual review, applied at the right points in the development lifecycle.
This article presents a comprehensive audit methodology for AI-generated code, covering identification, static analysis, dynamic testing, and organizational assessment. The methodology is designed to be repeatable, scalable, and integrated into existing security programs.
Audit Framework Overview
Phase Model
The AI code generation audit methodology consists of five phases:
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
class AuditPhase(Enum):
IDENTIFICATION = "1_identification"
STATIC_ANALYSIS = "2_static_analysis"
DYNAMIC_TESTING = "3_dynamic_testing"
ORGANIZATIONAL = "4_organizational"
REPORTING = "5_reporting"
@dataclass
class AuditPhaseDetail:
phase: AuditPhase
objective: str
activities: list[str]
tools: list[str]
outputs: list[str]
estimated_hours: float
AUDIT_PHASES = [
AuditPhaseDetail(
phase=AuditPhase.IDENTIFICATION,
objective="Identify which code was AI-generated and which tools were used",
activities=[
"Interview developers about AI tool usage patterns",
"Analyze git history for AI-generated commits",
"Review CI/CD logs for AI tool integration",
"Map AI tool configurations (.cursorrules, CLAUDE.md, .aiderignore)",
"Identify high-risk code areas (auth, crypto, input handling)",
],
tools=["git", "custom scripts", "developer interviews"],
outputs=[
"AI code generation inventory",
"Tool usage map",
"High-risk area identification",
],
estimated_hours=4.0,
),
AuditPhaseDetail(
phase=AuditPhase.STATIC_ANALYSIS,
objective="Detect vulnerability patterns characteristic of AI-generated code",
activities=[
"Run Semgrep with AI-specific rule sets",
"Run CodeQL for dataflow analysis",
"Run Bandit for Python security issues",
"Check for known AI vulnerability patterns",
"Scan for license compliance issues",
],
tools=["Semgrep", "CodeQL", "Bandit", "ScanCode"],
outputs=[
"Static analysis findings report",
"Vulnerability inventory with severity ratings",
"License compliance report",
],
estimated_hours=8.0,
),
AuditPhaseDetail(
phase=AuditPhase.DYNAMIC_TESTING,
objective="Verify that identified vulnerabilities are exploitable",
activities=[
"Test SQL injection points identified in static analysis",
"Test XSS sinks with crafted payloads",
"Test authentication and authorization boundaries",
"Test command injection in AI-generated shell interactions",
"Fuzz API endpoints generated by AI",
],
tools=["Burp Suite", "sqlmap", "custom scripts", "pytest"],
outputs=[
"Confirmed vulnerability list",
"Proof-of-concept exploits",
"False positive analysis",
],
estimated_hours=12.0,
),
AuditPhaseDetail(
phase=AuditPhase.ORGANIZATIONAL,
objective="Assess organizational controls around AI code generation",
activities=[
"Review AI tool governance policies",
"Assess developer training on AI code security",
"Review code review processes for AI-generated code",
"Evaluate CI/CD security gates for AI code",
"Check data classification compliance",
],
tools=["Policy documents", "interviews", "process review"],
outputs=[
"Organizational maturity assessment",
"Policy gap analysis",
"Process improvement recommendations",
],
estimated_hours=6.0,
),
AuditPhaseDetail(
phase=AuditPhase.REPORTING,
objective="Produce actionable findings with prioritized recommendations",
activities=[
"Consolidate technical and organizational findings",
"Risk-rank findings by exploitability and impact",
"Develop remediation recommendations",
"Create executive summary",
"Present findings to stakeholders",
],
tools=["Report templates", "risk frameworks"],
outputs=[
"Technical audit report",
"Executive summary",
"Remediation roadmap",
],
estimated_hours=4.0,
),
]Phase 1: AI Code Identification
Git History Analysis
The first challenge is identifying which code was generated by AI. Several signals help:
import subprocess
import re
import json
from pathlib import Path
from datetime import datetime
class AICodeIdentifier:
"""Identify AI-generated code in a repository."""
# Patterns that indicate AI-generated commits
AI_COMMIT_PATTERNS = [
r"(?i)generated\s+by\s+(copilot|cursor|claude|aider|gpt|ai)",
r"(?i)co-authored-by:.*\b(copilot|claude|aider)\b",
r"(?i)aider:", # Aider commit message prefix
r"(?i)\[ai\]|\[generated\]|\[copilot\]",
]
# Code patterns characteristic of AI generation
AI_CODE_PATTERNS = [
r"# Generated by",
r"// Auto-generated",
r"# TODO: Add error handling", # Common AI placeholder
r"# TODO: Add tests", # Common AI placeholder
r"pass\s+# placeholder", # AI placeholder pattern
]
def __init__(self, repo_path: str):
self.repo_path = repo_path
def identify_ai_commits(self, since: str = "6 months ago") -> list[dict]:
"""Identify commits likely generated by AI tools."""
result = subprocess.run(
["git", "log", f"--since={since}", "--format=%H|%an|%ae|%s|%aI"],
capture_output=True, text=True, cwd=self.repo_path,
)
ai_commits = []
for line in result.stdout.strip().split("\n"):
if not line:
continue
parts = line.split("|", 4)
if len(parts) < 5:
continue
commit_hash, author, email, subject, date = parts
for pattern in self.AI_COMMIT_PATTERNS:
if re.search(pattern, subject) or re.search(pattern, author):
ai_commits.append({
"hash": commit_hash,
"author": author,
"email": email,
"subject": subject,
"date": date,
"detection_method": "commit_pattern",
})
break
return ai_commits
def identify_ai_code_patterns(self) -> list[dict]:
"""Scan codebase for patterns characteristic of AI-generated code."""
findings = []
code_extensions = {".py", ".js", ".ts", ".jsx", ".tsx", ".java", ".go"}
for filepath in Path(self.repo_path).rglob("*"):
if filepath.suffix not in code_extensions:
continue
if any(skip in str(filepath) for skip in ["node_modules", ".git", "venv"]):
continue
try:
content = filepath.read_text()
for i, line in enumerate(content.split("\n"), 1):
for pattern in self.AI_CODE_PATTERNS:
if re.search(pattern, line):
findings.append({
"file": str(filepath.relative_to(self.repo_path)),
"line": i,
"pattern": pattern,
"content": line.strip()[:100],
})
except (UnicodeDecodeError, PermissionError):
pass
return findings
def analyze_coding_velocity(self, since: str = "3 months ago") -> dict:
"""Analyze commit velocity for signs of AI-assisted development."""
result = subprocess.run(
[
"git", "log", f"--since={since}",
"--format=%aI|%an", "--shortstat",
],
capture_output=True, text=True, cwd=self.repo_path,
)
# Parse and analyze for abnormal velocity patterns
# High insertions with few deletions in short time spans
# may indicate AI code generation
return {
"analysis": "velocity_patterns",
"note": "Sudden velocity increases may indicate AI tool adoption",
}Developer Interview Guide
# Structured interview questions for AI code generation audit
DEVELOPER_INTERVIEW = {
"tool_usage": [
"Which AI coding tools do you use? (Copilot, Cursor, Claude Code, Aider, other)",
"How frequently do you accept AI suggestions without modification?",
"Do you use AI for generating security-sensitive code (auth, crypto, input validation)?",
"What model do you typically use? (GPT-4, Claude, Codex, local models)",
"Do you use agent/autonomous mode, or only completion/suggestion mode?",
],
"code_review": [
"How do you review AI-generated code before accepting it?",
"Do you distinguish between human-written and AI-generated code in reviews?",
"Have you ever caught a security issue in AI-generated code?",
"Are AI-generated changes reviewed differently in pull requests?",
],
"configuration": [
"Is there a .cursorrules, CLAUDE.md, or .aiderignore in your projects?",
"Who maintains these configuration files?",
"Are there project-level security instructions for AI tools?",
"Do you use .cursorignore or similar to exclude sensitive files?",
],
"incidents": [
"Have you experienced any security issues related to AI-generated code?",
"Has AI-generated code ever introduced a bug that was hard to diagnose?",
"Have you seen AI suggest dependencies that didn't exist?",
],
}Phase 2: Static Analysis
AI-Specific Semgrep Configuration
# Semgrep rules targeting AI-generated code patterns
AI_CODE_SEMGREP_RULES = """
rules:
# SQL Injection patterns common in AI-generated code
- id: ai-sql-fstring
patterns:
- pattern: |
$CURSOR.execute(f"...{$VAR}...")
message: >
SQL injection via f-string. AI coding assistants commonly generate
this pattern. Use parameterized queries.
languages: [python]
severity: ERROR
metadata:
cwe: CWE-89
category: ai-generated
# Hardcoded secrets (AI often generates placeholder secrets)
- id: ai-hardcoded-secret
patterns:
- pattern: |
$KEY = "sk-..."
- pattern: |
api_key = "..."
- pattern: |
password = "..."
message: >
Hardcoded secret detected. AI coding assistants often generate
placeholder credentials that developers forget to replace.
languages: [python]
severity: ERROR
metadata:
cwe: CWE-798
category: ai-generated
# Missing input validation (AI often skips validation)
- id: ai-missing-input-validation
patterns:
- pattern: |
@app.route("...", methods=["POST"])
def $FUNC():
$DATA = request.json
...
$DB.insert($DATA)
- pattern-not: |
@app.route("...", methods=["POST"])
def $FUNC():
$DATA = request.json
...
validate(...)
...
$DB.insert($DATA)
message: >
POST endpoint inserts request data without validation.
AI-generated endpoints often skip input validation.
languages: [python]
severity: WARNING
metadata:
cwe: CWE-20
category: ai-generated
# Insecure deserialization
- id: ai-pickle-load
pattern: pickle.load(...)
message: >
pickle.load() on untrusted data allows arbitrary code execution.
AI assistants frequently suggest pickle for serialization.
languages: [python]
severity: ERROR
metadata:
cwe: CWE-502
category: ai-generated
# Missing error handling
- id: ai-bare-except
pattern: |
except:
pass
message: >
Bare except with pass suppresses all errors including security-relevant
ones. AI tools generate this as a placeholder.
languages: [python]
severity: WARNING
metadata:
cwe: CWE-755
category: ai-generated
"""Running the Static Analysis Suite
#!/bin/bash
# Comprehensive static analysis for AI-generated code audit
set -euo pipefail
PROJECT_DIR="${1:-.}"
REPORT_DIR="${2:-/tmp/ai-audit-report}"
mkdir -p "$REPORT_DIR"
echo "=== AI-Generated Code Security Audit - Static Analysis ==="
echo "Project: $PROJECT_DIR"
echo "Reports: $REPORT_DIR"
echo ""
# Step 1: Semgrep with AI-specific rules
echo "--- Step 1: Semgrep Analysis ---"
if command -v semgrep &>/dev/null; then
semgrep --config "p/owasp-top-ten" \
--config "p/python-sql-injection" \
--config "p/xss" \
--config "p/python-security-audit" \
"$PROJECT_DIR" \
--json \
--output "$REPORT_DIR/semgrep-results.json" \
--exclude "node_modules" --exclude ".git" --exclude "venv" \
2>/dev/null || true
# Summarize
python3 -c "
import json
with open('$REPORT_DIR/semgrep-results.json') as f:
data = json.load(f)
results = data.get('results', [])
print(f'Total findings: {len(results)}')
by_severity = {}
for r in results:
sev = r.get('extra', {}).get('severity', 'unknown')
by_severity[sev] = by_severity.get(sev, 0) + 1
for sev, count in sorted(by_severity.items()):
print(f' {sev}: {count}')
"
else
echo "Semgrep not installed. Install: pip install semgrep"
fi
# Step 2: Bandit (Python security linter)
echo ""
echo "--- Step 2: Bandit Analysis ---"
if command -v bandit &>/dev/null; then
bandit -r "$PROJECT_DIR" \
-f json \
-o "$REPORT_DIR/bandit-results.json" \
--exclude ".git,node_modules,venv,__pycache__" \
-ll \
2>/dev/null || true
python3 -c "
import json
with open('$REPORT_DIR/bandit-results.json') as f:
data = json.load(f)
results = data.get('results', [])
print(f'Total findings: {len(results)}')
by_severity = {}
for r in results:
sev = r.get('issue_severity', 'unknown')
by_severity[sev] = by_severity.get(sev, 0) + 1
for sev, count in sorted(by_severity.items()):
print(f' {sev}: {count}')
"
else
echo "Bandit not installed. Install: pip install bandit"
fi
# Step 3: Pattern-based checks for AI-specific issues
echo ""
echo "--- Step 3: AI-Specific Pattern Checks ---"
# Check for hardcoded credentials (common in AI-generated code)
echo "Hardcoded credentials:"
grep -rn "api_key\s*=\s*['\"]" "$PROJECT_DIR" --include="*.py" \
--exclude-dir=".git" --exclude-dir="node_modules" -c 2>/dev/null || echo " 0 matches"
# Check for eval/exec usage
echo "eval/exec usage:"
grep -rn "\beval\b\|\bexec\b" "$PROJECT_DIR" --include="*.py" \
--exclude-dir=".git" --exclude-dir="node_modules" -c 2>/dev/null || echo " 0 matches"
# Check for missing error handling patterns
echo "Bare except clauses:"
grep -rn "except:" "$PROJECT_DIR" --include="*.py" \
--exclude-dir=".git" --exclude-dir="node_modules" -c 2>/dev/null || echo " 0 matches"
echo ""
echo "=== Static Analysis Complete ==="
echo "Results saved to: $REPORT_DIR"Phase 3: Dynamic Testing
Targeted Testing Strategy
Dynamic testing for AI-generated code focuses on the vulnerability classes AI is most likely to introduce:
import requests
from urllib.parse import quote
class AICodeDynamicTester:
"""Dynamic testing focused on common AI-generated vulnerability patterns."""
def __init__(self, base_url: str):
self.base_url = base_url.rstrip("/")
self.findings: list[dict] = []
def test_sql_injection_endpoints(
self, endpoints: list[dict]
) -> list[dict]:
"""Test endpoints for SQL injection vulnerabilities."""
sqli_payloads = [
"' OR '1'='1",
"' OR '1'='1' --",
"1; DROP TABLE users --",
"' UNION SELECT NULL, username, password FROM users --",
"1' AND SLEEP(5) --",
]
findings = []
for endpoint in endpoints:
url = f"{self.base_url}{endpoint['path']}"
for param in endpoint.get("params", []):
for payload in sqli_payloads:
try:
params = {param: payload}
response = requests.get(url, params=params, timeout=10)
# Check for SQL error messages in response
sql_errors = [
"syntax error",
"mysql_fetch",
"sqlite3.OperationalError",
"psycopg2.errors",
"ORA-",
"SQLSTATE",
]
for error in sql_errors:
if error.lower() in response.text.lower():
findings.append({
"type": "sql_injection",
"endpoint": endpoint["path"],
"parameter": param,
"payload": payload,
"evidence": error,
"severity": "critical",
})
break
# Check for time-based blind SQLi
if "SLEEP" in payload and response.elapsed.total_seconds() > 4:
findings.append({
"type": "blind_sql_injection",
"endpoint": endpoint["path"],
"parameter": param,
"payload": payload,
"evidence": f"Response time: {response.elapsed.total_seconds():.1f}s",
"severity": "critical",
})
except requests.exceptions.Timeout:
if "SLEEP" in payload:
findings.append({
"type": "blind_sql_injection",
"endpoint": endpoint["path"],
"parameter": param,
"payload": payload,
"evidence": "Request timed out (possible sleep injection)",
"severity": "critical",
})
except requests.exceptions.RequestException:
pass
return findings
def test_xss_endpoints(self, endpoints: list[dict]) -> list[dict]:
"""Test endpoints for reflected XSS vulnerabilities."""
xss_payloads = [
'<script>alert("XSS")</script>',
'<img src=x onerror=alert(1)>',
'" onmouseover="alert(1)"',
"javascript:alert(1)",
'<svg/onload=alert(1)>',
]
findings = []
for endpoint in endpoints:
url = f"{self.base_url}{endpoint['path']}"
for param in endpoint.get("params", []):
for payload in xss_payloads:
try:
params = {param: payload}
response = requests.get(url, params=params, timeout=10)
# Check if payload is reflected without encoding
if payload in response.text:
findings.append({
"type": "reflected_xss",
"endpoint": endpoint["path"],
"parameter": param,
"payload": payload,
"evidence": "Payload reflected unencoded in response",
"severity": "high",
})
except requests.exceptions.RequestException:
pass
return findings
def test_authentication_bypass(
self, protected_endpoints: list[str]
) -> list[dict]:
"""Test for authentication bypass in AI-generated auth code."""
findings = []
bypass_techniques = [
{"name": "No auth header", "headers": {}},
{"name": "Empty bearer", "headers": {"Authorization": "Bearer "}},
{"name": "Invalid token", "headers": {"Authorization": "Bearer invalid"}},
{"name": "Admin role claim", "headers": {"X-User-Role": "admin"}},
]
for endpoint in protected_endpoints:
url = f"{self.base_url}{endpoint}"
for technique in bypass_techniques:
try:
response = requests.get(
url, headers=technique["headers"], timeout=10
)
if response.status_code == 200:
findings.append({
"type": "auth_bypass",
"endpoint": endpoint,
"technique": technique["name"],
"evidence": f"Got 200 OK with {technique['name']}",
"severity": "critical",
})
except requests.exceptions.RequestException:
pass
return findingsPhase 4: Organizational Assessment
Maturity Model
from dataclasses import dataclass
@dataclass
class MaturityDimension:
dimension: str
level_1: str # Initial
level_2: str # Developing
level_3: str # Defined
level_4: str # Managed
level_5: str # Optimizing
AI_CODE_SECURITY_MATURITY = [
MaturityDimension(
dimension="Tool Governance",
level_1="No policy on AI tool usage",
level_2="Informal guidance on approved tools",
level_3="Formal policy with approved tool list",
level_4="Policy enforced via technical controls",
level_5="Continuous assessment of new tools",
),
MaturityDimension(
dimension="Code Review",
level_1="No distinction between AI and human code",
level_2="Awareness that AI code needs extra review",
level_3="Documented review checklist for AI code",
level_4="Automated checks in CI/CD for AI patterns",
level_5="Metrics-driven continuous improvement",
),
MaturityDimension(
dimension="Developer Training",
level_1="No training on AI code security",
level_2="Ad hoc awareness communications",
level_3="Formal training program for AI tool users",
level_4="Regular training with assessments",
level_5="Threat modeling exercises with AI scenarios",
),
MaturityDimension(
dimension="Static Analysis",
level_1="No static analysis",
level_2="Generic SAST tools without AI-specific rules",
level_3="SAST with AI-specific rule sets",
level_4="Custom rules for organization's AI patterns",
level_5="ML-enhanced detection of AI code issues",
),
MaturityDimension(
dimension="Incident Response",
level_1="No AI-specific incident procedures",
level_2="AI code issues handled ad hoc",
level_3="Documented playbook for AI code incidents",
level_4="Practiced playbook with tabletop exercises",
level_5="Automated detection and response",
),
]Phase 5: Reporting
Report Template
def generate_audit_report(
project_name: str,
ai_commits: list,
static_findings: list,
dynamic_findings: list,
maturity_scores: dict,
) -> str:
"""Generate the final audit report."""
critical = sum(1 for f in static_findings + dynamic_findings if f.get("severity") == "critical")
high = sum(1 for f in static_findings + dynamic_findings if f.get("severity") == "high")
report = f"""
# AI-Generated Code Security Audit Report
## Project: {project_name}
## Date: {datetime.utcnow().strftime('%Y-%m-%d')}
---
## Executive Summary
This audit assessed the security of AI-generated code in the {project_name} project.
The assessment identified {len(ai_commits)} commits attributed to AI coding tools,
{len(static_findings)} static analysis findings, and {len(dynamic_findings)} confirmed
vulnerabilities through dynamic testing.
**Critical findings: {critical} | High findings: {high}**
## Key Findings
### 1. AI Code Generation Scope
- {len(ai_commits)} commits identified as AI-generated
- AI coding tools in use: [list from identification phase]
- Estimated percentage of codebase AI-generated: [X]%
### 2. Vulnerability Summary
| Severity | Static | Dynamic | Total |
|---|---|---|---|
| Critical | {sum(1 for f in static_findings if f.get('severity') == 'critical')} | {sum(1 for f in dynamic_findings if f.get('severity') == 'critical')} | {critical} |
| High | {sum(1 for f in static_findings if f.get('severity') == 'high')} | {sum(1 for f in dynamic_findings if f.get('severity') == 'high')} | {high} |
### 3. Top Vulnerability Categories
[Summarize by CWE]
### 4. Organizational Maturity
[Summarize maturity assessment]
## Recommendations
1. **Immediate**: Remediate all critical and high findings
2. **Short-term**: Deploy AI-specific Semgrep rules in CI/CD
3. **Medium-term**: Establish AI code generation governance policy
4. **Long-term**: Build organizational maturity to Level 3+
## Detailed Findings
[Individual findings with evidence, impact, and remediation]
"""
return reportReferences
- OWASP Code Review Guide — https://owasp.org/www-project-code-review-guide/
- Semgrep Documentation — https://semgrep.dev/docs/
- CodeQL Documentation — https://codeql.github.com/docs/
- "Do Users Write More Insecure Code with AI Assistants?" — Perry et al., 2023 — https://arxiv.org/abs/2211.03622
- OWASP Top 10 for LLM Applications 2025 — https://genai.owasp.org/llmrisk/
- NIST AI Risk Management Framework — https://www.nist.gov/artificial-intelligence/ai-risk-management-framework
- CWE Top 25 Most Dangerous Software Weaknesses — https://cwe.mitre.org/top25/