Security Gaps in AI-Generated Tests
Analyzing how AI-generated test suites systematically miss security-relevant test cases, creating dangerous coverage illusions.
Overview
Test generation is one of the most popular applications of AI code generation. It requires less creative design than feature implementation, the output quality is immediately measurable through coverage metrics, and the productivity gains are substantial — what took hours of manual test writing can be accomplished in minutes.
AI code generation tools are increasingly used to produce not just application code but the tests that validate it. GitHub Copilot, ChatGPT, Claude, and dedicated testing tools like CodiumAI and Diffblue Cover can generate test suites that achieve impressive code coverage numbers. On the surface, this looks like a win for quality — more tests, less effort, higher coverage percentages.
The problem is that AI-generated tests systematically miss security-relevant test cases. They test the happy path, they test obvious error conditions, and they test edge cases that appear in training data. What they consistently fail to test are adversarial inputs, race conditions, authorization boundary violations, cryptographic correctness properties, and the subtle interaction bugs that constitute the majority of exploitable security vulnerabilities.
This creates a particularly dangerous situation: developers see 85% code coverage from AI-generated tests and assume the code is well-tested. In reality, the 15% of untested code disproportionately includes security-critical paths, and even the tested code is only validated against benign inputs. The test suite provides confidence without corresponding security assurance.
This article documents the specific categories of security tests that AI tools systematically omit, demonstrates the gap between coverage metrics and security coverage, and provides frameworks for supplementing AI-generated tests with security-focused test cases.
The Coverage Illusion
Line Coverage vs. Security Coverage
Code coverage measures which lines of code are executed during testing. It does not measure whether those lines are tested with security-relevant inputs. Consider this authentication function and its AI-generated tests:
# Application code
import hashlib
import hmac
import time
def authenticate(username: str, password: str, db) -> dict:
"""Authenticate a user and return a session token."""
if not username or not password:
raise ValueError("Username and password required")
user = db.get_user(username)
if user is None:
raise AuthError("Invalid credentials")
password_hash = hashlib.sha256(password.encode()).hexdigest()
if password_hash != user.password_hash:
raise AuthError("Invalid credentials")
token = hmac.new(
SECRET_KEY, f"{username}:{time.time()}".encode(), hashlib.sha256
).hexdigest()
db.create_session(username, token)
return {"token": token, "user": username}# AI-generated tests — achieve ~90% line coverage
import pytest
from unittest.mock import MagicMock
def test_authenticate_success():
db = MagicMock()
db.get_user.return_value = MagicMock(
password_hash=hashlib.sha256(b"correct_password").hexdigest()
)
result = authenticate("alice", "correct_password", db)
assert "token" in result
assert result["user"] == "alice"
def test_authenticate_empty_username():
db = MagicMock()
with pytest.raises(ValueError):
authenticate("", "password", db)
def test_authenticate_empty_password():
db = MagicMock()
with pytest.raises(ValueError):
authenticate("alice", "", db)
def test_authenticate_wrong_password():
db = MagicMock()
db.get_user.return_value = MagicMock(
password_hash=hashlib.sha256(b"correct_password").hexdigest()
)
with pytest.raises(AuthError):
authenticate("alice", "wrong_password", db)
def test_authenticate_unknown_user():
db = MagicMock()
db.get_user.return_value = None
with pytest.raises(AuthError):
authenticate("unknown", "password", db)These tests achieve approximately 90% line coverage and look comprehensive. But they miss every security-relevant property:
# Security tests the AI did NOT generate
def test_timing_attack_resistance():
"""Password comparison should be constant-time."""
db = MagicMock()
db.get_user.return_value = MagicMock(
password_hash=hashlib.sha256(b"correct").hexdigest()
)
# Measure timing difference between correct prefix and completely wrong
# This test would reveal that sha256 comparison is not constant-time
times_similar = []
times_different = []
for _ in range(1000):
start = time.perf_counter_ns()
try:
authenticate("alice", "correctXXXXXX", db)
except AuthError:
pass
times_similar.append(time.perf_counter_ns() - start)
start = time.perf_counter_ns()
try:
authenticate("alice", "ZZZZZZZZZZZZ", db)
except AuthError:
pass
times_different.append(time.perf_counter_ns() - start)
# Statistical test for timing difference
avg_similar = sum(times_similar) / len(times_similar)
avg_different = sum(times_different) / len(times_different)
# If there is a statistically significant difference, timing attack is possible
def test_username_enumeration():
"""Error messages should not reveal whether a username exists."""
db = MagicMock()
db.get_user.side_effect = lambda u: (
MagicMock(password_hash="wrong") if u == "exists" else None
)
# Both should produce identical errors (message and timing)
with pytest.raises(AuthError) as exc1:
authenticate("exists", "wrong", db)
with pytest.raises(AuthError) as exc2:
authenticate("nonexistent", "wrong", db)
assert str(exc1.value) == str(exc2.value)
def test_password_hashing_algorithm():
"""Password should use a proper KDF, not bare SHA-256."""
# SHA-256 is not a suitable password hashing algorithm
# This test documents the vulnerability
# Proper implementation should use bcrypt, scrypt, or argon2
assert False, "authenticate() uses SHA-256 instead of a proper KDF"
def test_token_predictability():
"""Session tokens should not be predictable from username + timestamp."""
db = MagicMock()
db.get_user.return_value = MagicMock(
password_hash=hashlib.sha256(b"password").hexdigest()
)
# Generate multiple tokens in quick succession
tokens = set()
for _ in range(100):
result = authenticate("alice", "password", db)
tokens.add(result["token"])
# All tokens should be unique (rules out poor randomness)
assert len(tokens) == 100
def test_sql_injection_in_username():
"""Username should be safe against SQL injection."""
db = MagicMock()
db.get_user.return_value = None
# These should not cause database errors or unexpected behavior
malicious_inputs = [
"' OR '1'='1", "admin'--", "'; DROP TABLE users; --",
"admin\x00", "admin' UNION SELECT * FROM secrets --",
]
for payload in malicious_inputs:
with pytest.raises((AuthError, ValueError)):
authenticate(payload, "password", db)
# Verify the payload was passed safely to the database
db.get_user.assert_called_with(payload)Why AI Tests Miss Security Properties
The systematic omission of security tests has identifiable root causes:
Training data bias: The vast majority of test code in open-source repositories tests functional correctness, not security properties. AI models learn to reproduce the statistical distribution of their training data, which means they generate functional tests with high probability and security tests with low probability.
Specification gap: AI tools generate tests based on the code's apparent behavior, not its security requirements. A function that hashes passwords with SHA-256 "works correctly" from a functional perspective — the hash is computed and compared. The security deficiency (using SHA-256 instead of a proper KDF) requires domain knowledge that is not present in the code itself.
Happy-path optimization: LLMs are trained on human-written test suites that overwhelmingly test expected inputs and behaviors. Adversarial thinking — "what if an attacker sends a 10MB username?" — is underrepresented in training data.
Mock-heavy patterns: AI-generated tests heavily use mocking, which prevents security issues from manifesting. A mocked database will never reveal SQL injection. A mocked HTTP client will never reveal SSRF. The tests pass in isolation while the real integration is vulnerable.
Systematic Security Test Gaps
Authorization and Access Control Testing
AI-generated tests almost never test authorization boundaries — whether a user can access resources that belong to another user:
# AI-generated: tests that the endpoint works for an authorized user
def test_get_user_profile():
response = client.get("/api/users/123/profile",
headers={"Authorization": f"Bearer {user_123_token}"})
assert response.status_code == 200
assert response.json()["id"] == 123
# Security tests the AI missed:
def test_horizontal_privilege_escalation():
"""User 123 should not access user 456's profile."""
response = client.get("/api/users/456/profile",
headers={"Authorization": f"Bearer {user_123_token}"})
assert response.status_code == 403
def test_vertical_privilege_escalation():
"""Regular user should not access admin endpoints."""
response = client.get("/api/admin/users",
headers={"Authorization": f"Bearer {regular_user_token}"})
assert response.status_code == 403
def test_missing_auth_header():
"""Requests without auth should be rejected, not treated as anonymous."""
response = client.get("/api/users/123/profile")
assert response.status_code == 401
def test_expired_token():
"""Expired tokens should be rejected."""
response = client.get("/api/users/123/profile",
headers={"Authorization": f"Bearer {expired_token}"})
assert response.status_code == 401
def test_token_from_deleted_user():
"""Tokens from deleted users should be invalid."""
delete_user(123)
response = client.get("/api/users/123/profile",
headers={"Authorization": f"Bearer {user_123_token}"})
assert response.status_code == 401Race Condition Testing
Concurrency vulnerabilities are almost entirely absent from AI-generated tests:
import asyncio
import aiohttp
async def test_double_spend_race_condition():
"""
Test that concurrent withdrawal requests cannot overdraw an account.
Start with balance of 100, send two simultaneous withdrawals of 100.
Only one should succeed.
"""
# Set up account with balance of 100
await set_balance(user_id=1, amount=100)
async with aiohttp.ClientSession() as session:
# Send two withdrawal requests simultaneously
tasks = [
session.post(
f"{BASE_URL}/api/withdraw",
json={"user_id": 1, "amount": 100},
headers={"Authorization": f"Bearer {token}"}
)
for _ in range(2)
]
responses = await asyncio.gather(*tasks)
# Count successes
successes = sum(1 for r in responses if r.status == 200)
assert successes <= 1, f"Double spend: {successes} withdrawals succeeded"
# Verify final balance is not negative
balance = await get_balance(user_id=1)
assert balance >= 0, f"Negative balance after race condition: {balance}"
async def test_session_fixation_race():
"""
Test that login invalidates existing sessions atomically.
Concurrent login and API request should not allow the old
session to be used after re-authentication.
"""
old_token = await login("alice", "password")
async with aiohttp.ClientSession() as session:
# Simultaneously: re-login and use old token
tasks = [
session.post(f"{BASE_URL}/api/login",
json={"username": "alice", "password": "newpassword"}),
session.get(f"{BASE_URL}/api/sensitive-data",
headers={"Authorization": f"Bearer {old_token}"}),
]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Old token usage should fail after re-authenticationInput Validation Boundary Testing
AI tools test obvious invalid inputs but miss adversarial edge cases:
# AI-generated: basic validation tests
def test_username_empty():
assert validate_username("") is False
def test_username_valid():
assert validate_username("alice") is True
# Security tests the AI missed:
class TestUsernameAdversarial:
def test_null_byte_injection(self):
"""Null bytes can truncate strings in C-backed libraries."""
assert validate_username("admin\x00evil") is False
def test_unicode_normalization_confusion(self):
"""Different Unicode representations of the same character."""
# These look identical but are different byte sequences
assert validate_username("admin") is True # Latin
assert validate_username("аdmin") is False # Cyrillic 'а'
def test_extremely_long_input(self):
"""ReDoS or buffer overflow with long input."""
long_name = "a" * 1_000_000
# Should reject quickly, not hang or crash
import time
start = time.time()
result = validate_username(long_name)
elapsed = time.time() - start
assert result is False
assert elapsed < 1.0, f"Potential ReDoS: took {elapsed}s"
def test_format_string_characters(self):
"""Format string injection via username."""
assert validate_username("%s%s%s%s%s") is False
assert validate_username("${7*7}") is False
assert validate_username("{{7*7}}") is False # Template injection
def test_path_traversal_in_username(self):
"""Username used in file paths should not allow traversal."""
assert validate_username("../../../etc/passwd") is False
assert validate_username("..\\..\\..\\windows\\system32") is False
def test_crlf_injection(self):
"""CRLF characters in username for header injection."""
assert validate_username("alice\r\nX-Admin: true") is False
def test_utf8_overlong_encoding(self):
"""Overlong UTF-8 encodings that bypass validation."""
# Overlong encoding of '/' — may bypass path filters
overlong_slash = b"\xc0\xaf".decode("utf-8", errors="replace")
assert validate_username(f"admin{overlong_slash}..") is FalseCryptographic Correctness Testing
AI tools almost never generate tests that verify cryptographic properties:
def test_encryption_iv_uniqueness():
"""Each encryption operation must use a unique IV."""
key = os.urandom(32)
plaintext = b"test message"
ciphertexts = [encrypt(key, plaintext) for _ in range(100)]
# Extract IVs (first 16 bytes of each ciphertext)
ivs = [ct[:16] for ct in ciphertexts]
assert len(set(ivs)) == 100, "IV reuse detected"
def test_ciphertext_malleability():
"""Flipping a bit in ciphertext should cause decryption to fail,
not produce silently corrupted plaintext."""
key = os.urandom(32)
plaintext = b"transfer $100 to alice"
ciphertext = encrypt(key, plaintext)
# Flip a bit in the ciphertext
modified = bytearray(ciphertext)
modified[20] ^= 0x01
modified = bytes(modified)
with pytest.raises(DecryptionError):
decrypt(key, modified)
def test_encryption_not_ecb():
"""Encrypting identical blocks should produce different ciphertext."""
key = os.urandom(32)
# Two identical 16-byte blocks
plaintext = b"A" * 32
ciphertext = encrypt(key, plaintext)
# In ECB mode, the two 16-byte ciphertext blocks would be identical
block1 = ciphertext[16:32] # Skip IV
block2 = ciphertext[32:48]
assert block1 != block2, "ECB mode detected — identical plaintext blocks produce identical ciphertext"Measuring the Security Coverage Gap
The following framework quantifies the difference between code coverage and security coverage:
"""
Security coverage analyzer: compares AI-generated test coverage
against a security test requirements matrix.
"""
from dataclasses import dataclass, field
from enum import Enum
class SecurityTestCategory(Enum):
AUTH_BOUNDARY = "Authorization boundary testing"
INPUT_ADVERSARIAL = "Adversarial input testing"
RACE_CONDITION = "Concurrency/race condition testing"
CRYPTO_PROPERTY = "Cryptographic property testing"
ERROR_HANDLING = "Security-relevant error handling"
SESSION_MANAGEMENT = "Session management testing"
INJECTION = "Injection attack testing"
INFORMATION_DISCLOSURE = "Information disclosure testing"
@dataclass
class SecurityCoverageReport:
total_tests: int = 0
security_tests: int = 0
category_coverage: dict = field(default_factory=dict)
untested_security_paths: list = field(default_factory=list)
@property
def security_ratio(self) -> float:
if self.total_tests == 0:
return 0.0
return self.security_tests / self.total_tests
def summary(self) -> str:
lines = [
f"Total tests: {self.total_tests}",
f"Security-focused tests: {self.security_tests} "
f"({self.security_ratio:.1%})",
"",
"Category coverage:",
]
for cat, count in self.category_coverage.items():
lines.append(f" {cat}: {count} tests")
missing = set(c.value for c in SecurityTestCategory) - set(
self.category_coverage.keys()
)
if missing:
lines.append("")
lines.append("MISSING categories (zero tests):")
for cat in missing:
lines.append(f" - {cat}")
return "\n".join(lines)
def analyze_test_suite(test_files: list[str]) -> SecurityCoverageReport:
"""
Analyze a test suite for security coverage.
Heuristic-based: looks for patterns indicating security test intent.
"""
report = SecurityCoverageReport()
security_indicators = {
SecurityTestCategory.AUTH_BOUNDARY: [
r"unauthorized", r"forbidden", r"403", r"401",
r"privilege", r"escalat", r"access.control",
r"other.user", r"admin.*regular", r"horizontal",
],
SecurityTestCategory.INPUT_ADVERSARIAL: [
r"injection", r"malicious", r"adversarial",
r"\\x00", r"null.byte", r"overflow", r"redos",
r"unicode", r"normali[sz]", r"traversal",
],
SecurityTestCategory.RACE_CONDITION: [
r"race", r"concurrent", r"parallel", r"double.spend",
r"toctou", r"atomic", r"thread.safe",
],
SecurityTestCategory.CRYPTO_PROPERTY: [
r"iv.reuse", r"ecb", r"malleable", r"constant.time",
r"timing.attack", r"entropy", r"random",
],
SecurityTestCategory.INJECTION: [
r"sqli", r"xss", r"command.inject", r"template.inject",
r"ssti", r"ldap.inject", r"xpath",
],
}
import re
for test_file in test_files:
with open(test_file) as f:
content = f.read().lower()
# Count test functions
test_count = len(re.findall(r"def test_", content))
report.total_tests += test_count
for category, patterns in security_indicators.items():
matches = sum(
len(re.findall(p, content)) for p in patterns
)
if matches > 0:
cat_name = category.value
report.category_coverage[cat_name] = (
report.category_coverage.get(cat_name, 0) + matches
)
report.security_tests += min(matches, test_count)
return reportBuilding a Security Test Supplement Strategy
Organizations should not rely on AI-generated tests for security assurance. Instead, use AI-generated tests as a foundation for functional coverage and add security tests through deliberate, adversarial-focused methodologies:
Property-Based Testing: Use tools like Hypothesis (Python) or fast-check (JavaScript) to generate adversarial inputs automatically:
from hypothesis import given, strategies as st, settings
@given(
username=st.text(min_size=0, max_size=10000),
password=st.text(min_size=0, max_size=10000),
)
@settings(max_examples=1000)
def test_authenticate_never_crashes(username, password):
"""Authentication should handle any input without crashing."""
db = create_test_db()
try:
authenticate(username, password, db)
except (ValueError, AuthError):
pass # Expected errors
# Any other exception is a bugSecurity Requirement Matrices: Map each security requirement to specific test cases and track coverage at the requirement level, not just the code level.
Adversarial Test Generation Prompts: When using AI to generate tests, explicitly request security-focused tests:
Generate security tests for this authentication function. Include:
1. Authorization boundary tests (can user A access user B's data?)
2. Injection tests (SQL injection, command injection via all string inputs)
3. Timing attack tests (is password comparison constant-time?)
4. Race condition tests (concurrent login, concurrent session invalidation)
5. Cryptographic property tests (IV uniqueness, algorithm strength)
6. Input boundary tests (null bytes, Unicode normalization, extreme lengths)
Do NOT generate any functional/happy-path tests.
This directed prompting improves security test generation significantly but does not eliminate the need for human review of the generated tests.
Organizational Strategies for Closing the Gap
Security Test Requirement Matrix
For each application component, maintain a matrix that maps security requirements to specific test cases. This matrix serves as the authoritative record of what must be tested, independent of what any AI tool generates:
"""
Security test requirement matrix: ensures all security-relevant
behaviors are tested regardless of AI-generated test coverage.
"""
SECURITY_TEST_MATRIX = {
"authentication": {
"component": "src/auth/",
"requirements": [
{
"id": "AUTH-001",
"description": "Password comparison is constant-time",
"test_file": "tests/security/test_timing.py",
"test_function": "test_password_comparison_constant_time",
"category": "timing_attack",
"last_verified": "2026-03-15",
},
{
"id": "AUTH-002",
"description": "Failed login does not reveal whether username exists",
"test_file": "tests/security/test_auth.py",
"test_function": "test_no_username_enumeration",
"category": "information_disclosure",
"last_verified": "2026-03-15",
},
{
"id": "AUTH-003",
"description": "Account lockout after N failed attempts",
"test_file": "tests/security/test_auth.py",
"test_function": "test_account_lockout",
"category": "brute_force",
"last_verified": "2026-03-15",
},
{
"id": "AUTH-004",
"description": "Session tokens are invalidated on password change",
"test_file": "tests/security/test_session.py",
"test_function": "test_session_invalidation_on_password_change",
"category": "session_management",
"last_verified": "2026-03-15",
},
{
"id": "AUTH-005",
"description": "Concurrent login sessions handled correctly",
"test_file": "tests/security/test_concurrency.py",
"test_function": "test_concurrent_login_sessions",
"category": "race_condition",
"last_verified": "2026-03-15",
},
],
},
"payment_processing": {
"component": "src/payments/",
"requirements": [
{
"id": "PAY-001",
"description": "Double-spend prevented under concurrent requests",
"test_file": "tests/security/test_payments.py",
"test_function": "test_double_spend_prevention",
"category": "race_condition",
"last_verified": "2026-03-15",
},
{
"id": "PAY-002",
"description": "Payment amount cannot be manipulated client-side",
"test_file": "tests/security/test_payments.py",
"test_function": "test_amount_integrity",
"category": "business_logic",
"last_verified": "2026-03-15",
},
{
"id": "PAY-003",
"description": "PII is not logged in payment processing flow",
"test_file": "tests/security/test_logging.py",
"test_function": "test_payment_log_redaction",
"category": "information_disclosure",
"last_verified": "2026-03-15",
},
],
},
}
def audit_security_test_coverage(matrix: dict) -> dict:
"""
Audit that all security test requirements have corresponding
test implementations that actually exist and pass.
"""
import importlib
import os
report = {"covered": 0, "missing": 0, "stale": 0, "details": []}
for component, config in matrix.items():
for req in config["requirements"]:
test_exists = os.path.exists(req["test_file"])
if not test_exists:
report["missing"] += 1
report["details"].append({
"id": req["id"],
"status": "MISSING",
"reason": f"Test file {req['test_file']} not found",
})
continue
# Check if the specific test function exists
try:
spec = importlib.util.spec_from_file_location(
"test_module", req["test_file"]
)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
if hasattr(module, req["test_function"]):
report["covered"] += 1
report["details"].append({
"id": req["id"],
"status": "COVERED",
})
else:
report["missing"] += 1
report["details"].append({
"id": req["id"],
"status": "MISSING",
"reason": f"Function {req['test_function']} not found",
})
except Exception as e:
report["missing"] += 1
report["details"].append({
"id": req["id"],
"status": "ERROR",
"reason": str(e),
})
return reportCI/CD Integration: Security Test Gate
Add a CI/CD gate that verifies security test coverage independently of overall code coverage:
# GitHub Actions: Security test coverage gate
name: Security Test Coverage
on: [pull_request]
jobs:
security-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run security-specific tests
run: |
pytest tests/security/ -v --tb=short \
--junitxml=security-test-results.xml
- name: Verify security test matrix coverage
run: |
python scripts/audit_security_tests.py \
--matrix security_test_matrix.json \
--fail-on-missing
- name: Check for security tests in changed components
run: |
# Get list of modified source files
CHANGED=$(git diff --name-only origin/main...HEAD -- 'src/')
# For each changed component, verify security tests exist
for file in $CHANGED; do
component=$(echo "$file" | cut -d'/' -f1-2)
security_tests="tests/security/test_$(basename $component).py"
if [ -f "$security_tests" ]; then
echo "Security tests found for $component"
else
echo "::warning::No security tests found for $component"
fi
doneFuzzing Integration for Input Validation
Where property-based testing with Hypothesis covers structured adversarial inputs, fuzzing covers the unstructured edge: malformed data, corrupted inputs, and byte-level manipulation that no human tester would think to try:
# Integration with Atheris (Python coverage-guided fuzzer)
import atheris
import sys
def fuzz_json_parser(data):
"""Fuzz the application's JSON parser for crash bugs."""
fdp = atheris.FuzzedDataProvider(data)
try:
input_str = fdp.ConsumeUnicodeNoSurrogates(
fdp.ConsumeIntInRange(0, 10000)
)
# Call the application's input parsing function
parse_api_request(input_str)
except (ValueError, KeyError, TypeError):
pass # Expected errors
except Exception as e:
# Unexpected exceptions are bugs
raise
def fuzz_auth_token(data):
"""Fuzz the token validation function."""
fdp = atheris.FuzzedDataProvider(data)
token = fdp.ConsumeBytes(fdp.ConsumeIntInRange(0, 1000))
try:
validate_auth_token(token)
except (AuthError, ValueError):
pass # Expected
# Any other exception = bug
if __name__ == "__main__":
atheris.Setup(sys.argv, fuzz_json_parser)
atheris.Fuzz()Key Takeaways
AI-generated test suites create a dangerous illusion of security coverage. High line-coverage numbers mask systematic gaps in authorization testing, adversarial input testing, race condition testing, and cryptographic property testing. Organizations must treat AI-generated tests as a functional baseline that requires deliberate security supplementation. The gap between "this code is covered" and "this code is secure" is where exploitable vulnerabilities live, and closing that gap requires adversarial thinking that current AI tools do not reliably provide.
For red teams, AI-generated test gaps are an opportunity: if you know that the organization relies on AI-generated tests, you know which vulnerability categories are likely under-tested. Target those categories first — authorization boundaries, race conditions, cryptographic properties — because they are precisely the gaps where the organization's false confidence is highest and the defenses are weakest.
The most impactful change an organization can make is to stop using code coverage as a proxy for security assurance and instead adopt a security test requirement matrix that tracks coverage at the requirement level. When a security test requirement is missing, the gap is visible regardless of what the overall line coverage number says. This shifts the conversation from "we have 90% coverage" to "we have zero tests for authorization boundaries in the payment module" — a much more actionable framing that drives the right investments.
References
- Lemieux, C., et al. (2023). "CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models." International Conference on Software Engineering (ICSE). Analysis of how LLM-generated tests achieve high coverage while missing deep behavioral properties.
- Kang, S., et al. (2024). "Large Language Models are Few-Shot Testers: Exploring LLM-based General Bug Reproduction." IEEE/ACM International Conference on Software Engineering. Evaluation of LLM test generation capabilities and their limitations for security-relevant bugs.
- OWASP (2025). "Testing Guide v5." https://owasp.org/www-project-web-security-testing-guide/ — Industry standard for security testing methodology that enumerates test categories typically missed by automated generation.