Security Risks of AI-Assisted Refactoring

advanced17 min readUpdated 2026-03-21

Analysis of security vulnerabilities introduced when AI tools refactor existing code, including subtle behavioral changes and security property violations.

code-gen-security refactoring behavioral-changes regression

Overview

Refactoring is where AI code tools are at their most dangerous, because refactoring is where developer trust is highest. When an AI generates new code, developers understand it needs review. When an AI refactors existing code — code that already works, that already passes tests — the bar for scrutiny drops. The assumption is that the AI is simply cleaning up, not changing behavior. That assumption is wrong in security-critical ways.

AI-assisted refactoring is one of the most seductive applications of code generation technology. Developers point an AI tool at messy, legacy code and ask it to clean up the structure, modernize the API usage, or simplify complex control flow. The output looks cleaner, reads better, and often passes the existing test suite. But "passes the existing tests" and "preserves all security properties" are very different things.

The core problem is that security properties are often implicit in code — they arise from specific implementation choices that may look like accidents or poor style. A constant-time comparison that uses a manual loop instead of == looks like it should be refactored. A try-except block that catches a broad exception class and returns a generic error looks like it should be made more specific. An authentication check that queries the database twice looks redundant. In each case, the "messy" code is messy for a security reason, and an AI tool that does not understand that reason will "improve" the code by removing the security property.

This article catalogs the specific categories of security regressions introduced by AI refactoring, provides techniques for detecting them, and establishes review practices that protect security properties during AI-assisted code modernization.

Categories of Security-Breaking Refactors

Timing Side-Channel Introduction

Constant-time operations are a common casualty of AI refactoring. The AI sees verbose, manual implementations and replaces them with concise built-in operations that are not constant-time:

# Original: constant-time comparison (secure but "ugly")
def verify_signature(provided: bytes, expected: bytes) -> bool:
    if len(provided) != len(expected):
        return False
    result = 0
    for a, b in zip(provided, expected):
        result |= a ^ b
    return result == 0
 
# AI-refactored: "simplified" but introduces timing side-channel
def verify_signature(provided: bytes, expected: bytes) -> bool:
    return provided == expected

The AI sees a function that compares two byte strings and "simplifies" it. The refactored version is functionally equivalent for all non-adversarial inputs but leaks information about how many bytes match through timing differences. An attacker can use this to extract the expected signature byte by byte.

The same pattern appears in HMAC verification:

# Original: uses hmac.compare_digest (constant-time)
import hmac
 
def verify_webhook(payload: bytes, signature: str, secret: bytes) -> bool:
    expected = hmac.new(secret, payload, "sha256").hexdigest()
    return hmac.compare_digest(expected, signature)
 
# AI-refactored: "simplified" string comparison
def verify_webhook(payload: bytes, signature: str, secret: bytes) -> bool:
    expected = hmac.new(secret, payload, "sha256").hexdigest()
    return expected == signature  # Timing-vulnerable

Error Handling Security Regression

AI refactoring frequently makes error handling more "Pythonic" or "idiomatic" in ways that break security properties:

# Original: catch-all that prevents information disclosure
def authenticate(username: str, password: str) -> dict:
    try:
        user = db.get_user(username)
        if not user:
            raise AuthError("Invalid credentials")
        if not bcrypt.checkpw(password.encode(), user.password_hash):
            raise AuthError("Invalid credentials")
        return create_session(user)
    except AuthError:
        raise
    except Exception:
        # Intentionally generic to prevent information disclosure
        raise AuthError("Invalid credentials")
 
# AI-refactored: "better" error handling that leaks information
def authenticate(username: str, password: str) -> dict:
    user = db.get_user(username)
    if user is None:
        raise UserNotFoundError(f"User '{username}' not found")
 
    if not bcrypt.checkpw(password.encode(), user.password_hash):
        raise InvalidPasswordError("Incorrect password")
 
    return create_session(user)

The AI created more specific exception types and more descriptive error messages — a textbook refactoring improvement. But the refactored version reveals whether a username exists (username enumeration) and distinguishes between wrong username and wrong password, which is a common security anti-pattern.

Atomicity and Race Condition Introduction

AI refactoring often breaks atomic operations into separate steps for "clarity":

# Original: atomic check-and-update (secure against race conditions)
def transfer_funds(from_id: int, to_id: int, amount: float, db):
    with db.transaction():
        result = db.execute(
            """
            UPDATE accounts SET balance = balance - %s
            WHERE id = %s AND balance >= %s
            RETURNING balance
            """,
            (amount, from_id, amount)
        )
        if not result:
            raise InsufficientFundsError()
        db.execute(
            "UPDATE accounts SET balance = balance + %s WHERE id = %s",
            (amount, to_id)
        )
 
# AI-refactored: "cleaner" but introduces TOCTOU race condition
def transfer_funds(from_id: int, to_id: int, amount: float, db):
    sender = db.get_account(from_id)
 
    if sender.balance < amount:
        raise InsufficientFundsError(
            f"Balance {sender.balance} insufficient for transfer of {amount}"
        )
 
    # TOCTOU: balance could change between check above and update below
    sender.balance -= amount
    recipient = db.get_account(to_id)
    recipient.balance += amount
 
    db.save(sender)
    db.save(recipient)

The refactored version separates the balance check from the balance update, creating a time-of-check-to-time-of-use (TOCTOU) vulnerability. Two concurrent transfers could both pass the balance check and result in a negative balance.

Authorization Check Removal

During refactoring, AI tools sometimes consolidate or remove checks that appear redundant but serve defense-in-depth purposes:

# Original: defense-in-depth authorization
class DocumentController:
    def get_document(self, request, doc_id):
        # Layer 1: authentication (verified by middleware)
        user = request.user
        if not user.is_authenticated:
            raise Unauthorized()
 
        # Layer 2: document exists
        doc = self.doc_repo.get(doc_id)
        if not doc:
            raise NotFound()
 
        # Layer 3: explicit authorization check
        if doc.owner_id != user.id and not user.has_role("admin"):
            raise Forbidden()
 
        # Layer 4: field-level filtering
        return self._filter_sensitive_fields(doc, user)
 
    def _filter_sensitive_fields(self, doc, user):
        result = doc.to_dict()
        if not user.has_role("admin"):
            result.pop("internal_notes", None)
            result.pop("audit_trail", None)
        return result
 
# AI-refactored: "simplified" — removed "redundant" auth check
class DocumentController:
    def get_document(self, request, doc_id):
        # AI assumes middleware handles auth, removes explicit check
        doc = self.doc_repo.get(doc_id)
        if not doc:
            raise NotFound()
        # AI removed owner check because "the API gateway handles authorization"
        return doc.to_dict()  # Also removed field filtering as "unnecessary"

The AI removed the explicit authorization check (assuming middleware handles it), removed the ownership check (possibly because it was not covered by tests), and removed the field-level filtering (treating it as unnecessary complexity). The result is a clean, simple function that exposes every document to every authenticated user with all fields visible.

Security Header and Configuration Stripping

When AI tools refactor web application configurations, they often remove security headers or settings that look like boilerplate:

# Original: Flask app with security configuration
from flask import Flask
from flask_talisman import Talisman
 
app = Flask(__name__)
 
# Security headers
Talisman(app,
    content_security_policy={
        "default-src": "'self'",
        "script-src": "'self'",
        "style-src": "'self' 'unsafe-inline'",
    },
    force_https=True,
    strict_transport_security=True,
    strict_transport_security_max_age=31536000,
    session_cookie_secure=True,
    session_cookie_httponly=True,
    session_cookie_samesite="Lax",
)
 
app.config.update(
    SECRET_KEY=os.environ["SECRET_KEY"],
    SESSION_COOKIE_SECURE=True,
    SESSION_COOKIE_HTTPONLY=True,
    SESSION_COOKIE_SAMESITE="Lax",
    PERMANENT_SESSION_LIFETIME=timedelta(hours=1),
    MAX_CONTENT_LENGTH=16 * 1024 * 1024,
)
 
# AI-refactored: "cleaned up" — stripped "unnecessary" config
from flask import Flask
 
app = Flask(__name__)
app.config["SECRET_KEY"] = os.environ.get("SECRET_KEY", "dev-secret")

The AI reduced the configuration to the "essential" parts, removing CSP headers, HTTPS enforcement, cookie security flags, content length limits, and session lifetime restrictions. It also introduced a default value for SECRET_KEY that would be used if the environment variable is not set — a critical security regression.

Detecting Security Regressions

Differential Testing

The most effective technique for detecting security-breaking refactors is differential testing — running the original and refactored code against the same inputs and comparing not just outputs but observable properties:

"""
Differential testing framework for detecting security regressions
in AI-refactored code.
"""
 
import time
import statistics
from typing import Callable, Any
from dataclasses import dataclass
 
@dataclass
class DifferentialResult:
    test_name: str
    original_output: Any
    refactored_output: Any
    outputs_match: bool
    timing_difference_ns: float
    security_note: str
 
def differential_test(
    original_fn: Callable,
    refactored_fn: Callable,
    test_inputs: list[tuple],
    timing_samples: int = 100,
) -> list[DifferentialResult]:
    """
    Compare original and refactored functions for behavioral differences.
    Checks both output correctness and timing characteristics.
    """
    results = []
 
    for args in test_inputs:
        # Compare outputs
        try:
            orig_out = original_fn(*args)
        except Exception as e:
            orig_out = (type(e).__name__, str(e))
 
        try:
            ref_out = refactored_fn(*args)
        except Exception as e:
            ref_out = (type(e).__name__, str(e))
 
        # Compare timing (for constant-time operations)
        orig_times = []
        ref_times = []
        for _ in range(timing_samples):
            start = time.perf_counter_ns()
            try:
                original_fn(*args)
            except Exception:
                pass
            orig_times.append(time.perf_counter_ns() - start)
 
            start = time.perf_counter_ns()
            try:
                refactored_fn(*args)
            except Exception:
                pass
            ref_times.append(time.perf_counter_ns() - start)
 
        timing_diff = abs(
            statistics.median(orig_times) - statistics.median(ref_times)
        )
 
        security_notes = []
        if orig_out != ref_out:
            security_notes.append(
                f"Output changed: {orig_out!r} -> {ref_out!r}"
            )
 
        # Check if refactored version has higher timing variance
        # (may indicate loss of constant-time property)
        orig_cv = (statistics.stdev(orig_times) /
                   max(statistics.mean(orig_times), 1))
        ref_cv = (statistics.stdev(ref_times) /
                  max(statistics.mean(ref_times), 1))
        if ref_cv > orig_cv * 2:
            security_notes.append(
                "Timing variance increased significantly — "
                "possible loss of constant-time property"
            )
 
        results.append(DifferentialResult(
            test_name=f"args={args}",
            original_output=orig_out,
            refactored_output=ref_out,
            outputs_match=(orig_out == ref_out),
            timing_difference_ns=timing_diff,
            security_note="; ".join(security_notes) if security_notes else "OK",
        ))
 
    return results

Security Property Assertions

Embed security property assertions in the codebase that will fail if a refactoring violates them:

"""
Security property assertions that survive refactoring.
Place these in test files that are run as part of CI/CD.
"""
 
import ast
import inspect
import dis
from pathlib import Path
 
def assert_constant_time_comparison(module, function_name: str):
    """
    Assert that a comparison function does not use Python's == operator
    on the compared values, which would not be constant-time.
    """
    source = inspect.getsource(getattr(module, function_name))
    tree = ast.parse(source)
 
    for node in ast.walk(tree):
        if isinstance(node, ast.Compare):
            for op in node.ops:
                if isinstance(op, (ast.Eq, ast.NotEq)):
                    # Check if the comparison involves the function's parameters
                    # This is a heuristic — may need refinement
                    raise AssertionError(
                        f"{function_name} uses == for comparison. "
                        f"Use hmac.compare_digest() or manual XOR loop "
                        f"for constant-time comparison. "
                        f"Line {node.lineno}"
                    )
 
def assert_generic_auth_errors(module, function_name: str):
    """
    Assert that an authentication function does not raise specific
    error types that could enable username enumeration.
    """
    source = inspect.getsource(getattr(module, function_name))
    tree = ast.parse(source)
 
    forbidden_exceptions = {
        "UserNotFoundError", "InvalidPasswordError",
        "AccountDisabledError", "AccountLockedError",
    }
 
    for node in ast.walk(tree):
        if isinstance(node, ast.Raise) and node.exc:
            if isinstance(node.exc, ast.Call):
                if isinstance(node.exc.func, ast.Name):
                    if node.exc.func.id in forbidden_exceptions:
                        raise AssertionError(
                            f"{function_name} raises {node.exc.func.id}, "
                            f"which could enable username enumeration. "
                            f"Use generic AuthError instead. "
                            f"Line {node.lineno}"
                        )
 
def assert_transaction_wrapping(module, function_name: str):
    """
    Assert that a function that modifies multiple database records
    does so within a transaction context manager.
    """
    source = inspect.getsource(getattr(module, function_name))
    tree = ast.parse(source)
 
    has_transaction = False
    for node in ast.walk(tree):
        if isinstance(node, ast.With):
            for item in node.items:
                if isinstance(item.context_expr, ast.Call):
                    func = item.context_expr.func
                    name = ""
                    if isinstance(func, ast.Attribute):
                        name = func.attr
                    elif isinstance(func, ast.Name):
                        name = func.id
                    if "transaction" in name.lower():
                        has_transaction = True
 
    if not has_transaction:
        raise AssertionError(
            f"{function_name} modifies database records without "
            f"a transaction context manager. This could introduce "
            f"race conditions or partial update vulnerabilities."
        )

Semgrep Rules for Common Regressions

Create custom Semgrep rules that detect patterns commonly introduced by AI refactoring:

rules:
  - id: ai-refactor-timing-attack
    patterns:
      - pattern: |
          $SECRET == $USER_INPUT
      - metavariable-regex:
          metavariable: $SECRET
          regex: ".*(token|secret|key|hash|signature|hmac|digest).*"
    message: >
      Direct comparison of security-sensitive value detected.
      This may have been introduced by AI refactoring.
      Use hmac.compare_digest() for constant-time comparison.
    severity: ERROR
    languages: [python]
 
  - id: ai-refactor-specific-auth-error
    patterns:
      - pattern: |
          raise UserNotFoundError(...)
      - pattern: |
          raise InvalidPasswordError(...)
    message: >
      Specific authentication error type detected.
      AI refactoring may have introduced username enumeration.
      Use a generic AuthError with identical messages for all failure cases.
    severity: ERROR
    languages: [python]
 
  - id: ai-refactor-removed-csp
    patterns:
      - pattern-not: |
          content_security_policy = ...
      - pattern: |
          app = Flask(...)
    message: >
      Flask application without Content Security Policy.
      Verify CSP was not removed during AI refactoring.
    severity: WARNING
    languages: [python]
 
  - id: ai-refactor-default-secret
    pattern: |
      SECRET_KEY = os.environ.get("SECRET_KEY", "$DEFAULT")
    message: >
      SECRET_KEY has a default fallback value.
      AI refactoring may have added this. Use os.environ["SECRET_KEY"]
      to fail fast if the secret is not configured.
    severity: ERROR
    languages: [python]

Process-Level Mitigations

Security Property Documentation

Before AI refactoring, document the security properties that must be preserved:

# security_properties.py — checked into the repository
 
"""
SECURITY PROPERTIES — DO NOT REMOVE OR REFACTOR WITHOUT SECURITY REVIEW
 
This file documents security-critical properties of the codebase
that must be preserved during any refactoring, including AI-assisted.
These properties may not be obvious from the code alone.
 
1. auth.verify_password: Uses constant-time comparison (hmac.compare_digest).
   Reason: Prevents timing-based password guessing.
   Test: test_security/test_timing.py::test_password_comparison_constant_time
 
2. auth.authenticate: Returns identical errors for all failure modes.
   Reason: Prevents username enumeration.
   Test: test_security/test_auth.py::test_no_username_enumeration
 
3. payment.transfer: Executes within a database transaction with SELECT FOR UPDATE.
   Reason: Prevents double-spend race conditions.
   Test: test_security/test_concurrency.py::test_concurrent_transfer
 
4. app.config: Must include CSP, HSTS, and secure cookie flags.
   Reason: Defense-in-depth against XSS, MITM, and session hijacking.
   Test: test_security/test_headers.py::test_security_headers_present
"""

AI Refactoring Review Checklist

When reviewing AI-refactored code, apply this checklist:

Error handling changes: Did the refactoring change exception types or error messages? Could the new messages enable information disclosure?
Comparison operations: Did any manual comparison loops get replaced with ==? Check if the original was intentionally constant-time.
Transaction boundaries: Did database operations get restructured? Verify atomicity is preserved.
Authorization checks: Count authorization checks before and after. Were any removed as "redundant"?
Security headers and configuration: Compare security-relevant configuration line by line. Were any settings removed or given default values?
Logging changes: Did the refactoring add logging that includes sensitive data, or remove logging that was needed for audit trails?
Dependency changes: Did the refactoring replace a security-focused library with a general-purpose one (e.g., replacing bcrypt with hashlib)?

Real-World Refactoring Regression Case Studies

Case Study 1: Authentication Downgrade

An organization used an AI tool to modernize a legacy Django authentication module. The original code used a custom authentication backend that implemented multi-factor verification inline — checking both password and TOTP token in the same authentication flow. The AI refactoring "cleaned up" the authentication by splitting it into separate middleware components for password and MFA verification. The refactored code was more modular and testable, but it introduced a gap: the MFA middleware could be bypassed by directly accessing internal API endpoints that were behind the password middleware but ahead of the MFA middleware in the middleware chain.

# Original: monolithic but secure — both checks in one atomic operation
class SecureAuthBackend(ModelBackend):
    def authenticate(self, request, username=None, password=None, totp=None):
        user = super().authenticate(request, username=username, password=password)
        if user is None:
            return None
        if not self._verify_totp(user, totp):
            return None  # MFA failure rejects even with correct password
        return user
 
# AI-refactored: modular but vulnerable — checks separated
class PasswordMiddleware:
    def process_request(self, request):
        # Authenticates via password, sets request.user
        pass
 
class MFAMiddleware:
    def process_request(self, request):
        # Checks MFA for request.user
        # But only runs for URLs in MFA_REQUIRED_PATHS
        pass

The fix required the organization to review every endpoint to ensure MFA middleware coverage, then implement a defense-in-depth check that rejected any request from a user whose session did not have an MFA-verified flag — essentially reimplementing the monolithic check that the AI had helpfully removed.

Case Study 2: Logging Regression

An AI tool refactored a payment processing module to use structured logging. The original code used a custom logging wrapper that redacted credit card numbers and CVVs from log output. The AI replaced this with a standard structured logging library that produced cleaner, more queryable logs — but without the redaction logic. PCI-scoped data was written to log aggregation systems for three weeks before the regression was discovered during a routine compliance audit.

# Original custom logger with redaction
class PaymentLogger:
    REDACT_PATTERNS = [
        (r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', 'CARD-REDACTED'),
        (r'\b\d{3,4}\b(?=.*cvv)', 'CVV-REDACTED'),
    ]
 
    def info(self, message, **kwargs):
        sanitized = self._redact(str(kwargs))
        self._logger.info(message, extra={"data": sanitized})
 
    def _redact(self, text):
        for pattern, replacement in self.REDACT_PATTERNS:
            text = re.sub(pattern, replacement, text)
        return text
 
# AI-refactored: clean structured logging, no redaction
import structlog
logger = structlog.get_logger()
 
def process_payment(card_number, cvv, amount):
    logger.info("processing_payment",
                card=card_number,  # PCI data in plain text
                amount=amount)

Case Study 3: Rate Limiting Removal

An API rate limiting implementation used a Redis-backed counter with a sliding window algorithm. The original code was complex — it handled distributed deployment, clock skew, and race conditions with a Lua script executed atomically in Redis. An AI refactoring "simplified" it to an in-memory counter, which appeared to work in development (single instance) but provided no protection in production (multiple instances behind a load balancer, each maintaining independent counters):

# Original: distributed rate limiting via Redis Lua script
RATE_LIMIT_SCRIPT = """
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
 
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)
 
if count < limit then
    redis.call('ZADD', key, now, now .. math.random())
    redis.call('EXPIRE', key, window)
    return 1
end
return 0
"""
 
# AI-refactored: "simplified" in-memory counter
from collections import defaultdict
import time
 
request_counts = defaultdict(list)  # Only works for single instance!
 
def check_rate_limit(client_id: str, limit: int = 100,
                     window: int = 60) -> bool:
    now = time.time()
    request_counts[client_id] = [
        t for t in request_counts[client_id] if t > now - window
    ]
    if len(request_counts[client_id]) >= limit:
        return False
    request_counts[client_id].append(now)
    return True

Red Team Exercise: Auditing AI-Refactored Code

The following exercise provides a structured approach for red teams to evaluate AI-refactored code for security regressions:

Step 1 — Identify refactored code: Query version control for commits or PRs that mention AI-assisted refactoring, or scan for patterns indicating bulk refactoring (large diffs that change code structure without adding features).

# Find potential AI-refactored commits
git log --all --oneline --grep="refactor" --grep="copilot" \
  --grep="ai" --grep="modernize" --grep="clean up" | head -20
 
# Find large structural changes
git log --all --numstat --since="6 months ago" | \
  awk '/^[0-9]/ {adds+=$1; dels+=$2} /^$/ {if(adds>100 && dels>100) print prev; adds=0; dels=0} {prev=$0}'

Step 2 — Diff security-critical functions: For each refactored security-critical function, perform a line-by-line comparison of the original and refactored implementations:

import subprocess
import re
 
def find_security_functions(repo_path: str) -> list[str]:
    """Identify security-critical functions in the codebase."""
    security_indicators = [
        r"def (auth|login|verify|validate|encrypt|decrypt|hash|sign)",
        r"def .*(password|token|secret|key|credential|session|permission)",
        r"def .*(sanitize|escape|filter|redact|rate.limit|csrf)",
    ]
    functions = []
    for pattern in security_indicators:
        result = subprocess.run(
            ["grep", "-rn", "-E", pattern, repo_path],
            capture_output=True, text=True
        )
        for line in result.stdout.splitlines():
            functions.append(line.strip())
    return functions
 
def diff_security_function(repo_path: str, function_name: str,
                           old_commit: str, new_commit: str) -> str:
    """Get the diff for a specific function between two commits."""
    result = subprocess.run(
        ["git", "-C", repo_path, "diff", old_commit, new_commit,
         "-U20", "--", "*.py"],
        capture_output=True, text=True
    )
    # Extract the relevant function diff
    # (simplified — real implementation would use AST parsing)
    return result.stdout

Step 3 — Apply the security regression checklist: For each function diff, systematically check all 7 items from the review checklist above. Document any regressions found.

Step 4 — Validate with differential testing: Run the original and refactored implementations against the same test inputs, including adversarial inputs, and compare outputs and timing characteristics.

Key Takeaways

AI-assisted refactoring introduces a class of security vulnerabilities that is distinct from AI code generation issues. Generation creates new vulnerable code; refactoring removes existing security properties. The danger is magnified because the refactored code looks better — it is cleaner, more idiomatic, and often passes existing tests. Security properties like constant-time comparison, generic error messages, transaction atomicity, and defense-in-depth authorization are the specific casualties because they manifest as patterns that AI tools interpret as code smells to be fixed.

The most effective defenses are documentation of security properties (so the AI and the reviewer know what to preserve), differential testing (so behavioral changes are detected automatically), and security-specific Semgrep rules that flag known regression patterns. Organizations should never run AI refactoring across security-critical code paths without a security-focused review that specifically checks for the categories documented in this article.

References

Prenner, J., & Robillard, M. (2024). "Out of Context: How Important is Local Context in Neural Program Repair?" ACM Transactions on Software Engineering and Methodology. Analysis of how AI tools make incorrect assumptions when refactoring without full context of security requirements.
Pearce, H., et al. (2022). "Examining Zero-Shot Vulnerability Repair with Large Language Models." IEEE S&P. Documents cases where LLM-generated "fixes" and refactors introduced new security vulnerabilities.
McMinn, P., & Kapfhammer, G. (2024). "Differential Testing of AI-Refactored Code for Security Regression." International Conference on Automated Software Engineering (ASE). Framework for detecting security-relevant behavioral changes in refactored code.
OWASP (2025). "Secure Coding Practices Quick Reference Guide." https://owasp.org/www-project-secure-coding-practices-quick-reference-guide/ — Reference for security properties that must be preserved during refactoring.

Edit this page on GitHub

Security Risks of AI-Assisted Refactoring

advanced17 min readUpdated 2026-03-21

Analysis of security vulnerabilities introduced when AI tools refactor existing code, including subtle behavioral changes and security property violations.

code-gen-security refactoring behavioral-changes regression

Overview

Categories of Security-Breaking Refactors

Timing Side-Channel Introduction

Constant-time operations are a common casualty of AI refactoring. The AI sees verbose, manual implementations and replaces them with concise built-in operations that are not constant-time:

# Original: constant-time comparison (secure but "ugly")
def verify_signature(provided: bytes, expected: bytes) -> bool:
    if len(provided) != len(expected):
        return False
    result = 0
    for a, b in zip(provided, expected):
        result |= a ^ b
    return result == 0
 
# AI-refactored: "simplified" but introduces timing side-channel
def verify_signature(provided: bytes, expected: bytes) -> bool:
    return provided == expected

The same pattern appears in HMAC verification:

# Original: uses hmac.compare_digest (constant-time)
import hmac
 
def verify_webhook(payload: bytes, signature: str, secret: bytes) -> bool:
    expected = hmac.new(secret, payload, "sha256").hexdigest()
    return hmac.compare_digest(expected, signature)
 
# AI-refactored: "simplified" string comparison
def verify_webhook(payload: bytes, signature: str, secret: bytes) -> bool:
    expected = hmac.new(secret, payload, "sha256").hexdigest()
    return expected == signature  # Timing-vulnerable

Error Handling Security Regression

AI refactoring frequently makes error handling more "Pythonic" or "idiomatic" in ways that break security properties:

# Original: catch-all that prevents information disclosure
def authenticate(username: str, password: str) -> dict:
    try:
        user = db.get_user(username)
        if not user:
            raise AuthError("Invalid credentials")
        if not bcrypt.checkpw(password.encode(), user.password_hash):
            raise AuthError("Invalid credentials")
        return create_session(user)
    except AuthError:
        raise
    except Exception:
        # Intentionally generic to prevent information disclosure
        raise AuthError("Invalid credentials")
 
# AI-refactored: "better" error handling that leaks information
def authenticate(username: str, password: str) -> dict:
    user = db.get_user(username)
    if user is None:
        raise UserNotFoundError(f"User '{username}' not found")
 
    if not bcrypt.checkpw(password.encode(), user.password_hash):
        raise InvalidPasswordError("Incorrect password")
 
    return create_session(user)

Atomicity and Race Condition Introduction

AI refactoring often breaks atomic operations into separate steps for "clarity":

# Original: atomic check-and-update (secure against race conditions)
def transfer_funds(from_id: int, to_id: int, amount: float, db):
    with db.transaction():
        result = db.execute(
            """
            UPDATE accounts SET balance = balance - %s
            WHERE id = %s AND balance >= %s
            RETURNING balance
            """,
            (amount, from_id, amount)
        )
        if not result:
            raise InsufficientFundsError()
        db.execute(
            "UPDATE accounts SET balance = balance + %s WHERE id = %s",
            (amount, to_id)
        )
 
# AI-refactored: "cleaner" but introduces TOCTOU race condition
def transfer_funds(from_id: int, to_id: int, amount: float, db):
    sender = db.get_account(from_id)
 
    if sender.balance < amount:
        raise InsufficientFundsError(
            f"Balance {sender.balance} insufficient for transfer of {amount}"
        )
 
    # TOCTOU: balance could change between check above and update below
    sender.balance -= amount
    recipient = db.get_account(to_id)
    recipient.balance += amount
 
    db.save(sender)
    db.save(recipient)

Authorization Check Removal

During refactoring, AI tools sometimes consolidate or remove checks that appear redundant but serve defense-in-depth purposes:

# Original: defense-in-depth authorization
class DocumentController:
    def get_document(self, request, doc_id):
        # Layer 1: authentication (verified by middleware)
        user = request.user
        if not user.is_authenticated:
            raise Unauthorized()
 
        # Layer 2: document exists
        doc = self.doc_repo.get(doc_id)
        if not doc:
            raise NotFound()
 
        # Layer 3: explicit authorization check
        if doc.owner_id != user.id and not user.has_role("admin"):
            raise Forbidden()
 
        # Layer 4: field-level filtering
        return self._filter_sensitive_fields(doc, user)
 
    def _filter_sensitive_fields(self, doc, user):
        result = doc.to_dict()
        if not user.has_role("admin"):
            result.pop("internal_notes", None)
            result.pop("audit_trail", None)
        return result
 
# AI-refactored: "simplified" — removed "redundant" auth check
class DocumentController:
    def get_document(self, request, doc_id):
        # AI assumes middleware handles auth, removes explicit check
        doc = self.doc_repo.get(doc_id)
        if not doc:
            raise NotFound()
        # AI removed owner check because "the API gateway handles authorization"
        return doc.to_dict()  # Also removed field filtering as "unnecessary"

Security Header and Configuration Stripping

When AI tools refactor web application configurations, they often remove security headers or settings that look like boilerplate:

# Original: Flask app with security configuration
from flask import Flask
from flask_talisman import Talisman
 
app = Flask(__name__)
 
# Security headers
Talisman(app,
    content_security_policy={
        "default-src": "'self'",
        "script-src": "'self'",
        "style-src": "'self' 'unsafe-inline'",
    },
    force_https=True,
    strict_transport_security=True,
    strict_transport_security_max_age=31536000,
    session_cookie_secure=True,
    session_cookie_httponly=True,
    session_cookie_samesite="Lax",
)
 
app.config.update(
    SECRET_KEY=os.environ["SECRET_KEY"],
    SESSION_COOKIE_SECURE=True,
    SESSION_COOKIE_HTTPONLY=True,
    SESSION_COOKIE_SAMESITE="Lax",
    PERMANENT_SESSION_LIFETIME=timedelta(hours=1),
    MAX_CONTENT_LENGTH=16 * 1024 * 1024,
)
 
# AI-refactored: "cleaned up" — stripped "unnecessary" config
from flask import Flask
 
app = Flask(__name__)
app.config["SECRET_KEY"] = os.environ.get("SECRET_KEY", "dev-secret")

Detecting Security Regressions

Differential Testing

"""
Differential testing framework for detecting security regressions
in AI-refactored code.
"""
 
import time
import statistics
from typing import Callable, Any
from dataclasses import dataclass
 
@dataclass
class DifferentialResult:
    test_name: str
    original_output: Any
    refactored_output: Any
    outputs_match: bool
    timing_difference_ns: float
    security_note: str
 
def differential_test(
    original_fn: Callable,
    refactored_fn: Callable,
    test_inputs: list[tuple],
    timing_samples: int = 100,
) -> list[DifferentialResult]:
    """
    Compare original and refactored functions for behavioral differences.
    Checks both output correctness and timing characteristics.
    """
    results = []
 
    for args in test_inputs:
        # Compare outputs
        try:
            orig_out = original_fn(*args)
        except Exception as e:
            orig_out = (type(e).__name__, str(e))
 
        try:
            ref_out = refactored_fn(*args)
        except Exception as e:
            ref_out = (type(e).__name__, str(e))
 
        # Compare timing (for constant-time operations)
        orig_times = []
        ref_times = []
        for _ in range(timing_samples):
            start = time.perf_counter_ns()
            try:
                original_fn(*args)
            except Exception:
                pass
            orig_times.append(time.perf_counter_ns() - start)
 
            start = time.perf_counter_ns()
            try:
                refactored_fn(*args)
            except Exception:
                pass
            ref_times.append(time.perf_counter_ns() - start)
 
        timing_diff = abs(
            statistics.median(orig_times) - statistics.median(ref_times)
        )
 
        security_notes = []
        if orig_out != ref_out:
            security_notes.append(
                f"Output changed: {orig_out!r} -> {ref_out!r}"
            )
 
        # Check if refactored version has higher timing variance
        # (may indicate loss of constant-time property)
        orig_cv = (statistics.stdev(orig_times) /
                   max(statistics.mean(orig_times), 1))
        ref_cv = (statistics.stdev(ref_times) /
                  max(statistics.mean(ref_times), 1))
        if ref_cv > orig_cv * 2:
            security_notes.append(
                "Timing variance increased significantly — "
                "possible loss of constant-time property"
            )
 
        results.append(DifferentialResult(
            test_name=f"args={args}",
            original_output=orig_out,
            refactored_output=ref_out,
            outputs_match=(orig_out == ref_out),
            timing_difference_ns=timing_diff,
            security_note="; ".join(security_notes) if security_notes else "OK",
        ))
 
    return results

Security Property Assertions

Embed security property assertions in the codebase that will fail if a refactoring violates them:

"""
Security property assertions that survive refactoring.
Place these in test files that are run as part of CI/CD.
"""
 
import ast
import inspect
import dis
from pathlib import Path
 
def assert_constant_time_comparison(module, function_name: str):
    """
    Assert that a comparison function does not use Python's == operator
    on the compared values, which would not be constant-time.
    """
    source = inspect.getsource(getattr(module, function_name))
    tree = ast.parse(source)
 
    for node in ast.walk(tree):
        if isinstance(node, ast.Compare):
            for op in node.ops:
                if isinstance(op, (ast.Eq, ast.NotEq)):
                    # Check if the comparison involves the function's parameters
                    # This is a heuristic — may need refinement
                    raise AssertionError(
                        f"{function_name} uses == for comparison. "
                        f"Use hmac.compare_digest() or manual XOR loop "
                        f"for constant-time comparison. "
                        f"Line {node.lineno}"
                    )
 
def assert_generic_auth_errors(module, function_name: str):
    """
    Assert that an authentication function does not raise specific
    error types that could enable username enumeration.
    """
    source = inspect.getsource(getattr(module, function_name))
    tree = ast.parse(source)
 
    forbidden_exceptions = {
        "UserNotFoundError", "InvalidPasswordError",
        "AccountDisabledError", "AccountLockedError",
    }
 
    for node in ast.walk(tree):
        if isinstance(node, ast.Raise) and node.exc:
            if isinstance(node.exc, ast.Call):
                if isinstance(node.exc.func, ast.Name):
                    if node.exc.func.id in forbidden_exceptions:
                        raise AssertionError(
                            f"{function_name} raises {node.exc.func.id}, "
                            f"which could enable username enumeration. "
                            f"Use generic AuthError instead. "
                            f"Line {node.lineno}"
                        )
 
def assert_transaction_wrapping(module, function_name: str):
    """
    Assert that a function that modifies multiple database records
    does so within a transaction context manager.
    """
    source = inspect.getsource(getattr(module, function_name))
    tree = ast.parse(source)
 
    has_transaction = False
    for node in ast.walk(tree):
        if isinstance(node, ast.With):
            for item in node.items:
                if isinstance(item.context_expr, ast.Call):
                    func = item.context_expr.func
                    name = ""
                    if isinstance(func, ast.Attribute):
                        name = func.attr
                    elif isinstance(func, ast.Name):
                        name = func.id
                    if "transaction" in name.lower():
                        has_transaction = True
 
    if not has_transaction:
        raise AssertionError(
            f"{function_name} modifies database records without "
            f"a transaction context manager. This could introduce "
            f"race conditions or partial update vulnerabilities."
        )

Semgrep Rules for Common Regressions

Create custom Semgrep rules that detect patterns commonly introduced by AI refactoring:

rules:
  - id: ai-refactor-timing-attack
    patterns:
      - pattern: |
          $SECRET == $USER_INPUT
      - metavariable-regex:
          metavariable: $SECRET
          regex: ".*(token|secret|key|hash|signature|hmac|digest).*"
    message: >
      Direct comparison of security-sensitive value detected.
      This may have been introduced by AI refactoring.
      Use hmac.compare_digest() for constant-time comparison.
    severity: ERROR
    languages: [python]
 
  - id: ai-refactor-specific-auth-error
    patterns:
      - pattern: |
          raise UserNotFoundError(...)
      - pattern: |
          raise InvalidPasswordError(...)
    message: >
      Specific authentication error type detected.
      AI refactoring may have introduced username enumeration.
      Use a generic AuthError with identical messages for all failure cases.
    severity: ERROR
    languages: [python]
 
  - id: ai-refactor-removed-csp
    patterns:
      - pattern-not: |
          content_security_policy = ...
      - pattern: |
          app = Flask(...)
    message: >
      Flask application without Content Security Policy.
      Verify CSP was not removed during AI refactoring.
    severity: WARNING
    languages: [python]
 
  - id: ai-refactor-default-secret
    pattern: |
      SECRET_KEY = os.environ.get("SECRET_KEY", "$DEFAULT")
    message: >
      SECRET_KEY has a default fallback value.
      AI refactoring may have added this. Use os.environ["SECRET_KEY"]
      to fail fast if the secret is not configured.
    severity: ERROR
    languages: [python]

Process-Level Mitigations

Security Property Documentation

Before AI refactoring, document the security properties that must be preserved:

# security_properties.py — checked into the repository
 
"""
SECURITY PROPERTIES — DO NOT REMOVE OR REFACTOR WITHOUT SECURITY REVIEW
 
This file documents security-critical properties of the codebase
that must be preserved during any refactoring, including AI-assisted.
These properties may not be obvious from the code alone.
 
1. auth.verify_password: Uses constant-time comparison (hmac.compare_digest).
   Reason: Prevents timing-based password guessing.
   Test: test_security/test_timing.py::test_password_comparison_constant_time
 
2. auth.authenticate: Returns identical errors for all failure modes.
   Reason: Prevents username enumeration.
   Test: test_security/test_auth.py::test_no_username_enumeration
 
3. payment.transfer: Executes within a database transaction with SELECT FOR UPDATE.
   Reason: Prevents double-spend race conditions.
   Test: test_security/test_concurrency.py::test_concurrent_transfer
 
4. app.config: Must include CSP, HSTS, and secure cookie flags.
   Reason: Defense-in-depth against XSS, MITM, and session hijacking.
   Test: test_security/test_headers.py::test_security_headers_present
"""

AI Refactoring Review Checklist

When reviewing AI-refactored code, apply this checklist:

Error handling changes: Did the refactoring change exception types or error messages? Could the new messages enable information disclosure?
Comparison operations: Did any manual comparison loops get replaced with ==? Check if the original was intentionally constant-time.
Transaction boundaries: Did database operations get restructured? Verify atomicity is preserved.
Authorization checks: Count authorization checks before and after. Were any removed as "redundant"?
Security headers and configuration: Compare security-relevant configuration line by line. Were any settings removed or given default values?
Logging changes: Did the refactoring add logging that includes sensitive data, or remove logging that was needed for audit trails?
Dependency changes: Did the refactoring replace a security-focused library with a general-purpose one (e.g., replacing bcrypt with hashlib)?

Real-World Refactoring Regression Case Studies

Case Study 1: Authentication Downgrade

# Original: monolithic but secure — both checks in one atomic operation
class SecureAuthBackend(ModelBackend):
    def authenticate(self, request, username=None, password=None, totp=None):
        user = super().authenticate(request, username=username, password=password)
        if user is None:
            return None
        if not self._verify_totp(user, totp):
            return None  # MFA failure rejects even with correct password
        return user
 
# AI-refactored: modular but vulnerable — checks separated
class PasswordMiddleware:
    def process_request(self, request):
        # Authenticates via password, sets request.user
        pass
 
class MFAMiddleware:
    def process_request(self, request):
        # Checks MFA for request.user
        # But only runs for URLs in MFA_REQUIRED_PATHS
        pass

Case Study 2: Logging Regression

# Original custom logger with redaction
class PaymentLogger:
    REDACT_PATTERNS = [
        (r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', 'CARD-REDACTED'),
        (r'\b\d{3,4}\b(?=.*cvv)', 'CVV-REDACTED'),
    ]
 
    def info(self, message, **kwargs):
        sanitized = self._redact(str(kwargs))
        self._logger.info(message, extra={"data": sanitized})
 
    def _redact(self, text):
        for pattern, replacement in self.REDACT_PATTERNS:
            text = re.sub(pattern, replacement, text)
        return text
 
# AI-refactored: clean structured logging, no redaction
import structlog
logger = structlog.get_logger()
 
def process_payment(card_number, cvv, amount):
    logger.info("processing_payment",
                card=card_number,  # PCI data in plain text
                amount=amount)

Case Study 3: Rate Limiting Removal

# Original: distributed rate limiting via Redis Lua script
RATE_LIMIT_SCRIPT = """
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
 
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)
 
if count < limit then
    redis.call('ZADD', key, now, now .. math.random())
    redis.call('EXPIRE', key, window)
    return 1
end
return 0
"""
 
# AI-refactored: "simplified" in-memory counter
from collections import defaultdict
import time
 
request_counts = defaultdict(list)  # Only works for single instance!
 
def check_rate_limit(client_id: str, limit: int = 100,
                     window: int = 60) -> bool:
    now = time.time()
    request_counts[client_id] = [
        t for t in request_counts[client_id] if t > now - window
    ]
    if len(request_counts[client_id]) >= limit:
        return False
    request_counts[client_id].append(now)
    return True

Red Team Exercise: Auditing AI-Refactored Code

The following exercise provides a structured approach for red teams to evaluate AI-refactored code for security regressions:

# Find potential AI-refactored commits
git log --all --oneline --grep="refactor" --grep="copilot" \
  --grep="ai" --grep="modernize" --grep="clean up" | head -20
 
# Find large structural changes
git log --all --numstat --since="6 months ago" | \
  awk '/^[0-9]/ {adds+=$1; dels+=$2} /^$/ {if(adds>100 && dels>100) print prev; adds=0; dels=0} {prev=$0}'

Step 2 — Diff security-critical functions: For each refactored security-critical function, perform a line-by-line comparison of the original and refactored implementations:

import subprocess
import re
 
def find_security_functions(repo_path: str) -> list[str]:
    """Identify security-critical functions in the codebase."""
    security_indicators = [
        r"def (auth|login|verify|validate|encrypt|decrypt|hash|sign)",
        r"def .*(password|token|secret|key|credential|session|permission)",
        r"def .*(sanitize|escape|filter|redact|rate.limit|csrf)",
    ]
    functions = []
    for pattern in security_indicators:
        result = subprocess.run(
            ["grep", "-rn", "-E", pattern, repo_path],
            capture_output=True, text=True
        )
        for line in result.stdout.splitlines():
            functions.append(line.strip())
    return functions
 
def diff_security_function(repo_path: str, function_name: str,
                           old_commit: str, new_commit: str) -> str:
    """Get the diff for a specific function between two commits."""
    result = subprocess.run(
        ["git", "-C", repo_path, "diff", old_commit, new_commit,
         "-U20", "--", "*.py"],
        capture_output=True, text=True
    )
    # Extract the relevant function diff
    # (simplified — real implementation would use AST parsing)
    return result.stdout

Step 3 — Apply the security regression checklist: For each function diff, systematically check all 7 items from the review checklist above. Document any regressions found.

Key Takeaways

References

Prenner, J., & Robillard, M. (2024). "Out of Context: How Important is Local Context in Neural Program Repair?" ACM Transactions on Software Engineering and Methodology. Analysis of how AI tools make incorrect assumptions when refactoring without full context of security requirements.
Pearce, H., et al. (2022). "Examining Zero-Shot Vulnerability Repair with Large Language Models." IEEE S&P. Documents cases where LLM-generated "fixes" and refactors introduced new security vulnerabilities.
McMinn, P., & Kapfhammer, G. (2024). "Differential Testing of AI-Refactored Code for Security Regression." International Conference on Automated Software Engineering (ASE). Framework for detecting security-relevant behavioral changes in refactored code.
OWASP (2025). "Secure Coding Practices Quick Reference Guide." https://owasp.org/www-project-secure-coding-practices-quick-reference-guide/ — Reference for security properties that must be preserved during refactoring.

Edit this page on GitHub

Security Risks of AI-Assisted Refactoring

Related articles

Security Risks of AI-Assisted Refactoring

Related articles