SQL Injection via LLM Code Generation
How LLMs generate SQL injection vulnerabilities through string formatting, improper parameterization, and ORM misuse, with detection and prevention strategies.
Overview
SQL injection remains one of the most common vulnerability classes in web applications (CWE-89), and AI code generation tools make the problem worse. Research from Stanford and other institutions has shown that developers using AI coding assistants produce more security vulnerabilities than those coding without AI assistance, with SQL injection being one of the most frequently introduced flaw types.
The root cause is straightforward: LLMs learn from the vast corpus of code on the internet, and a significant portion of that code uses string formatting for SQL queries rather than parameterized queries. The model reproduces the patterns it has seen most frequently, and insecure patterns are often more common in tutorials, Stack Overflow answers, and introductory code examples that dominate training data.
This article examines the specific SQL injection patterns that LLMs generate, why they generate them, and how to detect and prevent these vulnerabilities in AI-assisted development workflows.
Why LLMs Generate SQL Injection
The Training Data Problem
LLMs learn code patterns from their training corpus. Consider the relative prevalence of these two approaches in publicly available code:
# Pattern 1: String formatting (INSECURE but COMMON in training data)
# This pattern appears frequently in tutorials, blog posts, and Stack Overflow
query = f"SELECT * FROM users WHERE username = '{username}'"
cursor.execute(query)
# Pattern 2: Parameterized query (SECURE but less common in tutorials)
query = "SELECT * FROM users WHERE username = %s"
cursor.execute(query, (username,))The insecure pattern appears more frequently in training data for several reasons:
- Tutorials prioritize readability and simplicity over security
- Quick examples and prototypes skip parameterization
- Stack Overflow answers often use the simplest approach
- Older code (before modern security awareness) uses string formatting
The LLM does not understand security. It predicts the most likely next token based on patterns. When a developer asks "write a function to look up a user by name," the model generates the pattern it has seen most often.
Empirical Evidence
The 2023 Stanford study "Do Users Write More Insecure Code with AI Assistants?" found that participants using AI coding assistants were more likely to produce insecure code, including SQL injection. The researchers observed that AI suggestions often provided functionally correct but security-flawed implementations that developers accepted without modification.
# Common LLM-generated SQL injection patterns categorized by severity
SQL_INJECTION_PATTERNS = {
"f_string_formatting": {
"severity": "critical",
"frequency": "very_common",
"example": '''
def get_user(username: str):
query = f"SELECT * FROM users WHERE name = '{username}'"
return db.execute(query).fetchone()
''',
"cwe": "CWE-89",
"fix": '''
def get_user(username: str):
query = "SELECT * FROM users WHERE name = ?"
return db.execute(query, (username,)).fetchone()
''',
},
"format_method": {
"severity": "critical",
"frequency": "common",
"example": '''
def search_products(category: str, min_price: float):
query = "SELECT * FROM products WHERE category = '{}' AND price >= {}".format(
category, min_price
)
return db.execute(query).fetchall()
''',
"cwe": "CWE-89",
"fix": '''
def search_products(category: str, min_price: float):
query = "SELECT * FROM products WHERE category = ? AND price >= ?"
return db.execute(query, (category, min_price)).fetchall()
''',
},
"percent_formatting": {
"severity": "critical",
"frequency": "common",
"example": '''
def find_orders(customer_id: str, status: str):
query = "SELECT * FROM orders WHERE customer_id = '%s' AND status = '%s'" % (
customer_id, status
)
return db.execute(query).fetchall()
''',
"cwe": "CWE-89",
"fix": '''
def find_orders(customer_id: str, status: str):
query = "SELECT * FROM orders WHERE customer_id = %s AND status = %s"
return db.execute(query, (customer_id, status)).fetchall()
''',
},
"string_concatenation": {
"severity": "critical",
"frequency": "common",
"example": '''
def delete_record(table: str, record_id: str):
query = "DELETE FROM " + table + " WHERE id = " + record_id
db.execute(query)
''',
"cwe": "CWE-89",
"fix": '''
def delete_record(table: str, record_id: int):
# Table names cannot be parameterized - use allowlist
allowed_tables = {"users", "orders", "products"}
if table not in allowed_tables:
raise ValueError(f"Invalid table: {table}")
query = f"DELETE FROM {table} WHERE id = ?"
db.execute(query, (record_id,))
''',
},
"orm_raw_query_misuse": {
"severity": "high",
"frequency": "moderate",
"example": '''
from django.db import connection
def custom_report(start_date: str, end_date: str):
with connection.cursor() as cursor:
cursor.execute(
f"SELECT * FROM reports WHERE date BETWEEN '{start_date}' AND '{end_date}'"
)
return cursor.fetchall()
''',
"cwe": "CWE-89",
"fix": '''
from django.db import connection
def custom_report(start_date: str, end_date: str):
with connection.cursor() as cursor:
cursor.execute(
"SELECT * FROM reports WHERE date BETWEEN %s AND %s",
[start_date, end_date],
)
return cursor.fetchall()
''',
},
}Advanced LLM SQL Injection Patterns
Dynamic Table and Column Names
One of the trickiest patterns LLMs generate involves dynamic table or column names. Parameterized queries cannot handle identifiers (table names, column names), so the LLM falls back to string formatting. Without an allowlist, this creates injection vulnerabilities:
# Pattern LLMs commonly generate (INSECURE)
def get_column_value(table: str, column: str, record_id: int):
"""LLMs often generate this pattern for dynamic queries."""
query = f"SELECT {column} FROM {table} WHERE id = {record_id}"
return db.execute(query).fetchone()
# Secure alternative with allowlisting
ALLOWED_TABLES = {"users", "products", "orders", "categories"}
ALLOWED_COLUMNS = {
"users": {"id", "name", "email", "created_at"},
"products": {"id", "name", "price", "category_id"},
"orders": {"id", "user_id", "total", "status", "created_at"},
"categories": {"id", "name", "description"},
}
def get_column_value_safe(table: str, column: str, record_id: int):
"""Secure version with allowlist validation."""
if table not in ALLOWED_TABLES:
raise ValueError(f"Invalid table: {table}")
if column not in ALLOWED_COLUMNS.get(table, set()):
raise ValueError(f"Invalid column: {column} for table {table}")
# Safe because table and column are validated against allowlist
query = f"SELECT {column} FROM {table} WHERE id = ?"
return db.execute(query, (record_id,)).fetchone()ORM Bypass Patterns
LLMs sometimes generate code that bypasses ORM protections by dropping to raw SQL, often because the developer asks for "better performance" or "a complex query":
# SQLAlchemy ORM bypass patterns LLMs generate
from sqlalchemy import create_engine, text
from sqlalchemy.orm import Session
engine = create_engine("sqlite:///app.db")
# INSECURE: LLM-generated raw SQL bypass
def search_users_insecure(search_term: str):
"""LLMs generate this when asked for 'fast search' or 'complex query'."""
with Session(engine) as session:
# This bypasses SQLAlchemy's parameterization
result = session.execute(
text(f"SELECT * FROM users WHERE name LIKE '%{search_term}%'")
)
return result.fetchall()
# SECURE: Proper parameterized approach
def search_users_secure(search_term: str):
"""Correct parameterized version."""
with Session(engine) as session:
result = session.execute(
text("SELECT * FROM users WHERE name LIKE :pattern"),
{"pattern": f"%{search_term}%"},
)
return result.fetchall()
# SECURE: Using ORM properly (preferred)
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
from sqlalchemy import String, select
class Base(DeclarativeBase):
pass
class User(Base):
__tablename__ = "users"
id: Mapped[int] = mapped_column(primary_key=True)
name: Mapped[str] = mapped_column(String(100))
def search_users_orm(search_term: str):
"""Correct ORM-based approach - no raw SQL needed."""
with Session(engine) as session:
stmt = select(User).where(User.name.contains(search_term))
return session.scalars(stmt).all()LIKE Clause Injection
LLMs frequently mishandle the LIKE clause, which requires special attention because the % and _ characters have special meaning:
# INSECURE: LLM-generated LIKE query (very common pattern)
def search_products(query: str):
sql = f"SELECT * FROM products WHERE name LIKE '%{query}%'"
return db.execute(sql).fetchall()
# Still INSECURE: parameterized but LIKE wildcards not escaped
def search_products_partial_fix(query: str):
sql = "SELECT * FROM products WHERE name LIKE %s"
return db.execute(sql, (f"%{query}%",)).fetchall()
# User can still inject LIKE wildcards: "%" matches everything
# SECURE: proper LIKE with wildcard escaping
def search_products_secure(query: str):
# Escape LIKE wildcards in user input
escaped = query.replace("\\", "\\\\").replace("%", "\\%").replace("_", "\\_")
sql = "SELECT * FROM products WHERE name LIKE %s ESCAPE '\\'"
return db.execute(sql, (f"%{escaped}%",)).fetchall()Detection with Static Analysis
Semgrep Rules
Semgrep is effective at catching LLM-generated SQL injection patterns:
# Semgrep rules for detecting LLM-generated SQL injection
SEMGREP_SQL_INJECTION_RULES = """
rules:
- id: python-sql-fstring-injection
patterns:
- pattern: |
$CURSOR.execute(f"...{$VAR}...")
message: >
SQL query uses f-string with variable interpolation.
Use parameterized queries instead. This pattern is commonly
generated by AI coding assistants.
languages: [python]
severity: ERROR
metadata:
cwe: CWE-89
owasp: A03:2021 - Injection
- id: python-sql-format-injection
patterns:
- pattern: |
$CURSOR.execute("...".format(...))
message: >
SQL query uses .format() string interpolation.
Use parameterized queries instead.
languages: [python]
severity: ERROR
metadata:
cwe: CWE-89
- id: python-sql-concat-injection
patterns:
- pattern: |
$CURSOR.execute("..." + $VAR + "...")
message: >
SQL query uses string concatenation.
Use parameterized queries instead.
languages: [python]
severity: ERROR
metadata:
cwe: CWE-89
- id: sqlalchemy-raw-fstring
patterns:
- pattern: |
text(f"...{$VAR}...")
message: >
SQLAlchemy text() with f-string bypasses ORM parameterization.
Use text("... :param ...") with named parameters.
languages: [python]
severity: ERROR
metadata:
cwe: CWE-89
"""CodeQL Queries
# CodeQL query for detecting SQL injection in AI-generated Python code
CODEQL_SQL_INJECTION = """
/**
* @name SQL injection from AI-generated code patterns
* @description Detects common SQL injection patterns that AI coding
* assistants frequently generate.
* @kind path-problem
* @problem.severity error
* @precision high
* @id py/ai-generated-sql-injection
* @tags security
* cwe-89
*/
import python
import semmle.python.dataflow.new.DataFlow
import semmle.python.dataflow.new.TaintTracking
import semmle.python.Concepts
import semmle.python.dataflow.new.RemoteFlowSources
module SqlInjectionConfig implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
source instanceof RemoteFlowSource
}
predicate isSink(DataFlow::Node sink) {
exists(SqlExecution exec | sink = exec.getSql())
}
}
module SqlInjectionFlow = TaintTracking::Global<SqlInjectionConfig>;
import SqlInjectionFlow::PathGraph
from SqlInjectionFlow::PathNode source, SqlInjectionFlow::PathNode sink
where SqlInjectionFlow::flowPath(source, sink)
select sink.getNode(), source, sink,
"SQL injection vulnerability: user input from $@ flows to SQL query.",
source.getNode(), "user input"
"""CI/CD Integration
Automated SQL Injection Detection Pipeline
#!/bin/bash
# CI script to detect SQL injection in AI-generated code
# Run this in your CI/CD pipeline on every PR
set -euo pipefail
echo "=== SQL Injection Detection for AI-Generated Code ==="
PROJECT_DIR="${1:-.}"
# Step 1: Run Semgrep with SQL injection rules
echo ""
echo "--- Semgrep SQL Injection Scan ---"
if command -v semgrep &>/dev/null; then
semgrep --config "p/python-sql-injection" \
--config "p/owasp-top-ten" \
"$PROJECT_DIR" \
--json \
--output /tmp/semgrep-sql-results.json \
|| true
# Count findings
FINDINGS=$(python3 -c "
import json
with open('/tmp/semgrep-sql-results.json') as f:
data = json.load(f)
results = data.get('results', [])
sql_findings = [r for r in results if 'sql' in r.get('check_id', '').lower() or 'injection' in r.get('check_id', '').lower()]
print(len(sql_findings))
")
echo "SQL injection findings: $FINDINGS"
else
echo "Semgrep not installed. Install with: pip install semgrep"
fi
# Step 2: Check for common patterns with grep
echo ""
echo "--- Pattern-Based Detection ---"
echo "Checking for f-string SQL patterns..."
FSTRING_COUNT=$(grep -rn 'execute(f"' "$PROJECT_DIR" --include="*.py" | grep -ic "select\|insert\|update\|delete\|where" || echo "0")
echo "f-string SQL patterns found: $FSTRING_COUNT"
echo "Checking for .format() SQL patterns..."
FORMAT_COUNT=$(grep -rn '.format(' "$PROJECT_DIR" --include="*.py" | grep -ic "select\|insert\|update\|delete\|where" || echo "0")
echo ".format() SQL patterns found: $FORMAT_COUNT"
# Step 3: Summary
echo ""
echo "=== Summary ==="
TOTAL=$((FSTRING_COUNT + FORMAT_COUNT))
if [ "$TOTAL" -gt 0 ]; then
echo "FAIL: Found $TOTAL potential SQL injection patterns"
echo "Review findings and convert to parameterized queries"
exit 1
else
echo "PASS: No obvious SQL injection patterns detected"
fiPython Pre-Commit Hook
#!/usr/bin/env python3
"""Pre-commit hook to detect SQL injection in AI-generated code."""
import ast
import sys
from pathlib import Path
class SQLInjectionDetector(ast.NodeVisitor):
"""AST-based detector for SQL injection patterns."""
def __init__(self, filename: str):
self.filename = filename
self.findings: list[dict] = []
def visit_Call(self, node: ast.Call):
"""Check for cursor.execute() calls with unsafe arguments."""
# Check if this is a .execute() call
if isinstance(node.func, ast.Attribute) and node.func.attr == "execute":
if node.args:
first_arg = node.args[0]
self._check_sql_argument(first_arg, node.lineno)
self.generic_visit(node)
def _check_sql_argument(self, node: ast.expr, lineno: int):
"""Check if a SQL argument uses unsafe string formatting."""
# f-string detection
if isinstance(node, ast.JoinedStr):
# Check if it looks like SQL
for value in node.values:
if isinstance(value, ast.Constant) and isinstance(value.value, str):
sql_keywords = ["select", "insert", "update", "delete", "where", "from"]
if any(kw in value.value.lower() for kw in sql_keywords):
self.findings.append({
"file": self.filename,
"line": lineno,
"type": "f-string SQL",
"severity": "ERROR",
})
return
# .format() detection
if isinstance(node, ast.Call):
if isinstance(node.func, ast.Attribute) and node.func.attr == "format":
if isinstance(node.func.value, ast.Constant):
sql_keywords = ["select", "insert", "update", "delete", "where"]
if any(kw in str(node.func.value.value).lower() for kw in sql_keywords):
self.findings.append({
"file": self.filename,
"line": lineno,
"type": ".format() SQL",
"severity": "ERROR",
})
# String concatenation with + operator
if isinstance(node, ast.BinOp) and isinstance(node.op, ast.Add):
self.findings.append({
"file": self.filename,
"line": lineno,
"type": "concatenation SQL (check manually)",
"severity": "WARNING",
})
def main():
exit_code = 0
for filepath in sys.argv[1:]:
if not filepath.endswith(".py"):
continue
try:
source = Path(filepath).read_text()
tree = ast.parse(source)
detector = SQLInjectionDetector(filepath)
detector.visit(tree)
for finding in detector.findings:
print(
f"{finding['severity']}: {finding['file']}:{finding['line']} "
f"- {finding['type']} detected in .execute() call"
)
if finding["severity"] == "ERROR":
exit_code = 1
except SyntaxError:
pass
sys.exit(exit_code)
if __name__ == "__main__":
main()Developer Education
Quick Reference: Safe SQL Patterns
| Framework | Insecure (LLM often generates) | Secure |
|---|---|---|
| sqlite3 | execute(f"... {var} ...") | execute("... ? ...", (var,)) |
| psycopg2 | execute(f"... {var} ...") | execute("... %s ...", (var,)) |
| MySQL Connector | execute(f"... {var} ...") | execute("... %s ...", (var,)) |
| SQLAlchemy Core | text(f"... {var} ...") | text("... :param ..."), {"param": var} |
| Django ORM | raw(f"... {var} ...") | raw("... %s ...", [var]) |
| Flask-SQLAlchemy | db.engine.execute(f"...") | Use ORM methods or parameterized text() |
Prompt Engineering for Safer Code
When using AI coding assistants, developers can reduce SQL injection risk by being explicit in their prompts:
# Instead of: "write a function to search users by name"
# Use: "write a function to search users by name using parameterized queries"
# Instead of: "create a database query for the report"
# Use: "create a parameterized database query using SQLAlchemy ORM for the report"
# Add to project .cursorrules or CLAUDE.md:
PROJECT_SECURITY_RULES = """
## Database Access Rules
- NEVER use f-strings, .format(), or string concatenation in SQL queries
- ALWAYS use parameterized queries with placeholders
- For dynamic table/column names, ALWAYS use an allowlist
- Prefer ORM methods over raw SQL
- When raw SQL is necessary, use framework-specific parameterization
"""References
- CWE-89: Improper Neutralization of Special Elements used in an SQL Command — https://cwe.mitre.org/data/definitions/89.html
- "Do Users Write More Insecure Code with AI Assistants?" — Perry et al., Stanford University, 2023 — https://arxiv.org/abs/2211.03622
- OWASP Top 10 for LLM Applications 2025 — LLM02: Insecure Output Handling — https://genai.owasp.org/llmrisk/
- Semgrep SQL Injection Rules — https://semgrep.dev/p/python-sql-injection
- OWASP SQL Injection Prevention Cheat Sheet — https://cheatsheetseries.owasp.org/cheatsheets/SQL_Injection_Prevention_Cheat_Sheet.html
- CodeQL Python SQL Injection Queries — https://codeql.github.com/codeql-query-help/python/py-sql-injection/