Secrets Rotation for Cloud AI Deployments

advanced18 min readUpdated 2026-03-21

Implementing automated secrets rotation strategies for API keys, model endpoint credentials, and service accounts used in cloud AI/LLM deployments across AWS, Azure, and GCP.

cloud secrets rotation credentials automation

Overview

Cloud AI deployments present unique secrets management challenges that traditional rotation strategies do not adequately address. A standard web application might have a database password and an API key to rotate. An enterprise LLM application can have dozens of credential types: model provider API keys (OpenAI, Anthropic, Cohere), cloud AI service credentials (Bedrock, Azure OpenAI, Vertex AI), vector database authentication tokens, embedding service keys, guardrail service credentials, monitoring API tokens, and service accounts for fine-tuning pipelines. Each credential type has different rotation constraints, blast radius considerations, and downtime implications.

The problem is compounded by AI-specific operational patterns. Fine-tuning jobs run for hours or days and hold credentials for the entire duration. RAG pipelines maintain persistent connections to vector databases. Agent orchestration frameworks cache credentials across multiple tool calls. Model serving endpoints may cache authentication tokens in warm inference containers. A naive rotation strategy that simply swaps credentials on a schedule will cause cascading failures across these interdependent services.

This article presents a comprehensive secrets rotation framework designed specifically for cloud AI deployments. The framework handles the full lifecycle: credential inventory, rotation scheduling, zero-downtime swap procedures, validation, rollback, and monitoring. It is implemented using cloud-native secrets managers and can be adapted to any cloud provider.

AI-Specific Credential Taxonomy

Before implementing rotation, you must inventory every credential type in your AI deployment and understand its rotation constraints.

# secrets_rotation/taxonomy.py
"""Credential taxonomy and rotation constraints for AI deployments."""
 
from __future__ import annotations
 
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
 
class CredentialType(Enum):
    """Types of credentials used in cloud AI deployments."""
    MODEL_PROVIDER_API_KEY = "model_provider_api_key"
    CLOUD_AI_SERVICE_KEY = "cloud_ai_service_key"
    CLOUD_SERVICE_ACCOUNT = "cloud_service_account"
    VECTOR_DB_TOKEN = "vector_db_token"
    EMBEDDING_SERVICE_KEY = "embedding_service_key"
    GUARDRAIL_SERVICE_KEY = "guardrail_service_key"
    MONITORING_API_KEY = "monitoring_api_key"
    OAUTH_CLIENT_SECRET = "oauth_client_secret"
    FINE_TUNING_SERVICE_ACCOUNT = "fine_tuning_service_account"
    MCP_SERVER_TOKEN = "mcp_server_token"
 
class RotationConstraint(Enum):
    """Constraints that affect how a credential can be rotated."""
    SUPPORTS_DUAL_KEY = "supports_dual_key"        # Provider supports two active keys simultaneously
    IMMEDIATE_REVOCATION = "immediate_revocation"    # Old key stops working immediately after rotation
    GRACE_PERIOD = "grace_period"                    # Old key works for a configurable period after rotation
    LONG_RUNNING_JOBS = "long_running_jobs"          # Credential may be held by long-running processes
    CACHED_IN_CONTAINERS = "cached_in_containers"    # Credential may be cached in warm serving containers
    MULTI_SERVICE_DEPENDENCY = "multi_service"       # Multiple services use this credential
    REQUIRES_RESTART = "requires_restart"            # Service must restart to pick up new credential
 
@dataclass
class AICredential:
    """A credential used in an AI deployment with rotation metadata."""
    credential_id: str
    name: str
    credential_type: CredentialType
    secrets_manager_arn: str          # Where the credential is stored
    rotation_interval_days: int       # How often to rotate
    last_rotated: Optional[str] = None
    rotation_constraints: list[RotationConstraint] = field(default_factory=list)
    dependent_services: list[str] = field(default_factory=list)
    blast_radius: str = "low"         # low, medium, high, critical
    rotation_tested: bool = False
    notes: str = ""
 
# Credential catalog for a typical enterprise AI deployment
CREDENTIAL_CATALOG: list[AICredential] = [
    AICredential(
        credential_id="cred-openai-api-key",
        name="OpenAI API Key (Production)",
        credential_type=CredentialType.MODEL_PROVIDER_API_KEY,
        secrets_manager_arn="arn:aws:secretsmanager:us-east-1:123456789:secret:ai/openai-api-key",
        rotation_interval_days=90,
        rotation_constraints=[
            RotationConstraint.SUPPORTS_DUAL_KEY,  # OpenAI allows multiple active keys
            RotationConstraint.CACHED_IN_CONTAINERS,
            RotationConstraint.MULTI_SERVICE_DEPENDENCY,
        ],
        dependent_services=["chatbot-api", "rag-pipeline", "eval-service"],
        blast_radius="critical",
        notes="OpenAI allows creating new keys without revoking old ones. Use dual-key rotation.",
    ),
    AICredential(
        credential_id="cred-pinecone-token",
        name="Pinecone Vector DB API Key",
        credential_type=CredentialType.VECTOR_DB_TOKEN,
        secrets_manager_arn="arn:aws:secretsmanager:us-east-1:123456789:secret:ai/pinecone-key",
        rotation_interval_days=90,
        rotation_constraints=[
            RotationConstraint.IMMEDIATE_REVOCATION,
            RotationConstraint.CACHED_IN_CONTAINERS,
        ],
        dependent_services=["rag-pipeline", "embedding-indexer"],
        blast_radius="high",
        notes="Pinecone key rotation invalidates the old key immediately. Requires careful coordination.",
    ),
    AICredential(
        credential_id="cred-bedrock-service-role",
        name="AWS Bedrock Service Role",
        credential_type=CredentialType.CLOUD_SERVICE_ACCOUNT,
        secrets_manager_arn="",  # IAM role — no secret stored
        rotation_interval_days=365,
        rotation_constraints=[
            RotationConstraint.LONG_RUNNING_JOBS,
        ],
        dependent_services=["bedrock-inference", "bedrock-fine-tuning"],
        blast_radius="critical",
        notes="IAM role uses temporary credentials via STS. Rotate access key if using static keys.",
    ),
    AICredential(
        credential_id="cred-finetuning-sa",
        name="Fine-tuning Pipeline Service Account",
        credential_type=CredentialType.FINE_TUNING_SERVICE_ACCOUNT,
        secrets_manager_arn="arn:aws:secretsmanager:us-east-1:123456789:secret:ai/finetune-sa",
        rotation_interval_days=180,
        rotation_constraints=[
            RotationConstraint.LONG_RUNNING_JOBS,
            RotationConstraint.REQUIRES_RESTART,
        ],
        dependent_services=["fine-tuning-pipeline"],
        blast_radius="medium",
        notes="Fine-tuning jobs hold credentials for 2-48 hours. Schedule rotation during job gaps.",
    ),
]

Rotation Strategy Implementation

The Dual-Key Rotation Pattern

For credentials that support multiple active keys simultaneously (OpenAI, most cloud providers), use the dual-key pattern that enables zero-downtime rotation.

# secrets_rotation/dual_key_rotator.py
"""Dual-key rotation for AI service credentials with zero downtime."""
 
from __future__ import annotations
 
import json
import logging
import time
from abc import ABC, abstractmethod
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
 
import boto3
 
logger = logging.getLogger(__name__)
 
@dataclass
class RotationResult:
    """Result of a credential rotation operation."""
    credential_id: str
    success: bool
    old_key_revoked: bool
    new_key_validated: bool
    rollback_performed: bool
    duration_seconds: float
    error: Optional[str] = None
    timestamp: str = ""
 
    def __post_init__(self):
        if not self.timestamp:
            self.timestamp = datetime.utcnow().isoformat()
 
class CredentialProvider(ABC):
    """Abstract interface for credential providers that support rotation."""
 
    @abstractmethod
    def create_new_key(self) -> dict:
        """Create a new API key/credential. Returns key metadata."""
        ...
 
    @abstractmethod
    def validate_key(self, key: str) -> bool:
        """Validate that a key is functional by making a test API call."""
        ...
 
    @abstractmethod
    def revoke_key(self, key_id: str) -> bool:
        """Revoke an old API key."""
        ...
 
    @abstractmethod
    def list_active_keys(self) -> list[dict]:
        """List all active keys for this credential."""
        ...
 
class OpenAIKeyProvider(CredentialProvider):
    """Manages OpenAI API key rotation."""
 
    def __init__(self, admin_key: str, organization_id: str):
        self.admin_key = admin_key
        self.org_id = organization_id
 
    def create_new_key(self) -> dict:
        """Create a new OpenAI API key via the admin API."""
        import httpx
 
        response = httpx.post(
            "https://api.openai.com/v1/organization/api_keys",
            headers={
                "Authorization": f"Bearer {self.admin_key}",
                "OpenAI-Organization": self.org_id,
            },
            json={
                "name": f"auto-rotated-{datetime.utcnow().strftime('%Y%m%d-%H%M%S')}",
            },
        )
        response.raise_for_status()
        data = response.json()
        return {"key_id": data["id"], "key_value": data["key"]}
 
    def validate_key(self, key: str) -> bool:
        """Validate an OpenAI key by listing models."""
        import httpx
 
        try:
            response = httpx.get(
                "https://api.openai.com/v1/models",
                headers={"Authorization": f"Bearer {key}"},
                timeout=10,
            )
            return response.status_code == 200
        except Exception:
            return False
 
    def revoke_key(self, key_id: str) -> bool:
        """Revoke an OpenAI API key."""
        import httpx
 
        try:
            response = httpx.delete(
                f"https://api.openai.com/v1/organization/api_keys/{key_id}",
                headers={
                    "Authorization": f"Bearer {self.admin_key}",
                    "OpenAI-Organization": self.org_id,
                },
            )
            return response.status_code in (200, 204)
        except Exception:
            return False
 
    def list_active_keys(self) -> list[dict]:
        """List all active OpenAI API keys."""
        import httpx
 
        response = httpx.get(
            "https://api.openai.com/v1/organization/api_keys",
            headers={
                "Authorization": f"Bearer {self.admin_key}",
                "OpenAI-Organization": self.org_id,
            },
        )
        response.raise_for_status()
        return response.json().get("data", [])
 
class DualKeyRotator:
    """
    Implements zero-downtime dual-key rotation for AI service credentials.
 
    The rotation follows a five-phase process:
    1. CREATE: Generate a new credential
    2. VALIDATE: Verify the new credential works
    3. UPDATE: Store the new credential in secrets manager
    4. PROPAGATE: Wait for all consumers to pick up the new credential
    5. REVOKE: Delete the old credential
    """
 
    def __init__(
        self,
        secrets_client: boto3.client,
        credential_provider: CredentialProvider,
        propagation_wait_seconds: int = 300,
    ):
        self.secrets_client = secrets_client
        self.provider = credential_provider
        self.propagation_wait = propagation_wait_seconds
 
    def rotate(
        self,
        secret_arn: str,
        credential_id: str,
        dry_run: bool = False,
    ) -> RotationResult:
        """
        Execute the full dual-key rotation process.
 
        Args:
            secret_arn: AWS Secrets Manager ARN for the credential.
            credential_id: Internal identifier for the credential.
            dry_run: If True, validate but do not actually rotate.
 
        Returns:
            RotationResult with success/failure status and details.
        """
        start_time = time.monotonic()
 
        try:
            # Phase 1: READ current secret
            logger.info(f"[{credential_id}] Phase 1: Reading current secret")
            current_secret = self._read_secret(secret_arn)
            old_key_id = current_secret.get("key_id", "")
            old_key_value = current_secret.get("key_value", "")
 
            if dry_run:
                # In dry run, just validate the current key
                is_valid = self.provider.validate_key(old_key_value)
                elapsed = time.monotonic() - start_time
                return RotationResult(
                    credential_id=credential_id,
                    success=True,
                    old_key_revoked=False,
                    new_key_validated=is_valid,
                    rollback_performed=False,
                    duration_seconds=elapsed,
                )
 
            # Phase 2: CREATE new key
            logger.info(f"[{credential_id}] Phase 2: Creating new key")
            new_key = self.provider.create_new_key()
 
            # Phase 3: VALIDATE new key
            logger.info(f"[{credential_id}] Phase 3: Validating new key")
            if not self.provider.validate_key(new_key["key_value"]):
                # New key doesn't work — abort and revoke it
                logger.error(f"[{credential_id}] New key validation failed, aborting")
                self.provider.revoke_key(new_key["key_id"])
                elapsed = time.monotonic() - start_time
                return RotationResult(
                    credential_id=credential_id,
                    success=False,
                    old_key_revoked=False,
                    new_key_validated=False,
                    rollback_performed=True,
                    duration_seconds=elapsed,
                    error="New key validation failed",
                )
 
            # Phase 4: UPDATE secrets manager with new key
            logger.info(f"[{credential_id}] Phase 4: Updating secrets manager")
            self._update_secret(secret_arn, {
                "key_id": new_key["key_id"],
                "key_value": new_key["key_value"],
                "rotated_at": datetime.utcnow().isoformat(),
                "previous_key_id": old_key_id,
            })
 
            # Phase 5: PROPAGATE — wait for consumers to pick up new key
            logger.info(
                f"[{credential_id}] Phase 5: Waiting {self.propagation_wait}s for propagation"
            )
            time.sleep(self.propagation_wait)
 
            # Phase 6: REVOKE old key
            logger.info(f"[{credential_id}] Phase 6: Revoking old key")
            old_revoked = False
            if old_key_id:
                old_revoked = self.provider.revoke_key(old_key_id)
                if not old_revoked:
                    logger.warning(
                        f"[{credential_id}] Failed to revoke old key {old_key_id}. "
                        "Manual cleanup required."
                    )
 
            elapsed = time.monotonic() - start_time
            return RotationResult(
                credential_id=credential_id,
                success=True,
                old_key_revoked=old_revoked,
                new_key_validated=True,
                rollback_performed=False,
                duration_seconds=elapsed,
            )
 
        except Exception as e:
            elapsed = time.monotonic() - start_time
            logger.exception(f"[{credential_id}] Rotation failed: {e}")
            return RotationResult(
                credential_id=credential_id,
                success=False,
                old_key_revoked=False,
                new_key_validated=False,
                rollback_performed=False,
                duration_seconds=elapsed,
                error=str(e),
            )
 
    def _read_secret(self, secret_arn: str) -> dict:
        """Read the current secret value from Secrets Manager."""
        response = self.secrets_client.get_secret_value(SecretId=secret_arn)
        return json.loads(response["SecretString"])
 
    def _update_secret(self, secret_arn: str, new_value: dict) -> None:
        """Update the secret value in Secrets Manager."""
        self.secrets_client.put_secret_value(
            SecretId=secret_arn,
            SecretString=json.dumps(new_value),
        )

Handling Long-Running AI Jobs

Fine-tuning jobs, batch evaluation pipelines, and continuous embedding indexers hold credentials for extended periods. Rotating credentials while these jobs are running requires careful coordination.

# secrets_rotation/job_aware_rotator.py
"""Job-aware rotation that coordinates with long-running AI workloads."""
 
from __future__ import annotations
 
import logging
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
 
logger = logging.getLogger(__name__)
 
@dataclass
class RunningJob:
    """A long-running AI job that holds credentials."""
    job_id: str
    job_type: str              # fine_tuning, batch_inference, embedding_index
    credential_ids: list[str]  # Which credentials this job uses
    started_at: datetime
    estimated_completion: Optional[datetime]
    can_be_interrupted: bool
    checkpoint_available: bool  # Can the job resume from a checkpoint?
 
class JobAwareRotationScheduler:
    """Schedules credential rotation around long-running AI jobs."""
 
    def __init__(self, max_rotation_delay_hours: int = 48):
        self.max_delay = timedelta(hours=max_rotation_delay_hours)
        self._running_jobs: list[RunningJob] = []
 
    def register_job(self, job: RunningJob):
        """Register a long-running job that holds credentials."""
        self._running_jobs.append(job)
        logger.info(
            f"Registered job {job.job_id} holding credentials: {job.credential_ids}"
        )
 
    def unregister_job(self, job_id: str):
        """Remove a completed job from the tracker."""
        self._running_jobs = [j for j in self._running_jobs if j.job_id != job_id]
 
    def can_rotate_now(self, credential_id: str) -> dict:
        """
        Check if a credential can be safely rotated right now.
 
        Returns a decision dict with rotation guidance.
        """
        blocking_jobs = [
            j for j in self._running_jobs
            if credential_id in j.credential_ids
        ]
 
        if not blocking_jobs:
            return {
                "can_rotate": True,
                "blocking_jobs": [],
                "recommended_action": "Proceed with immediate rotation",
            }
 
        # Check if all blocking jobs can be interrupted safely
        all_interruptible = all(j.can_be_interrupted for j in blocking_jobs)
        all_checkpointed = all(j.checkpoint_available for j in blocking_jobs)
 
        # Find the earliest completion time among blocking jobs
        completions = [
            j.estimated_completion for j in blocking_jobs
            if j.estimated_completion
        ]
        earliest_completion = min(completions) if completions else None
 
        now = datetime.utcnow()
 
        if all_interruptible and all_checkpointed:
            return {
                "can_rotate": True,
                "blocking_jobs": [j.job_id for j in blocking_jobs],
                "recommended_action": (
                    "Jobs are interruptible with checkpoints. "
                    "Interrupt jobs, rotate credentials, resume from checkpoint."
                ),
                "pre_rotation_steps": [
                    f"Checkpoint job {j.job_id}" for j in blocking_jobs
                ],
            }
 
        if earliest_completion and (earliest_completion - now) < self.max_delay:
            return {
                "can_rotate": False,
                "blocking_jobs": [j.job_id for j in blocking_jobs],
                "recommended_action": (
                    f"Delay rotation until {earliest_completion.isoformat()}. "
                    f"Blocking jobs will complete within the allowed delay window."
                ),
                "retry_after": earliest_completion.isoformat(),
            }
 
        # Jobs are running too long — must force rotation for security
        return {
            "can_rotate": True,
            "blocking_jobs": [j.job_id for j in blocking_jobs],
            "recommended_action": (
                "FORCE ROTATION: Blocking jobs exceed maximum rotation delay. "
                "Security policy requires rotation. Accept that jobs will fail."
            ),
            "force_reason": (
                f"Jobs have been running since "
                f"{min(j.started_at for j in blocking_jobs).isoformat()} "
                f"and exceed the {self.max_delay} maximum delay policy."
            ),
            "post_rotation_steps": [
                f"Restart job {j.job_id} with new credentials"
                for j in blocking_jobs
            ],
        }

Monitoring and Alerting

# secrets_rotation/monitoring.py
"""Monitoring and alerting for AI credential rotation health."""
 
from __future__ import annotations
 
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Optional
 
@dataclass
class CredentialHealthCheck:
    """Health status of a managed credential."""
    credential_id: str
    name: str
    is_valid: bool
    days_since_rotation: int
    rotation_overdue: bool
    max_age_days: int
    anomalous_usage: bool
    usage_count_24h: int
    usage_count_avg_24h: float
    health_status: str       # healthy, warning, critical, expired
 
    @property
    def urgency_score(self) -> float:
        """Calculate urgency for rotation prioritization (0-100)."""
        score = 0
        if self.rotation_overdue:
            overdue_days = self.days_since_rotation - self.max_age_days
            score += min(50, overdue_days * 5)
        if self.anomalous_usage:
            score += 30
        if not self.is_valid:
            score = 100  # Invalid credential is maximum urgency
        return min(100, score)
 
class RotationHealthMonitor:
    """Monitors the health of all managed AI credentials."""
 
    def __init__(self, alert_callback=None):
        self.alert_callback = alert_callback
        self._health_history: list[dict] = []
 
    def check_all(
        self,
        credentials: list[dict],
        current_time: Optional[datetime] = None,
    ) -> list[CredentialHealthCheck]:
        """
        Run health checks on all managed credentials.
 
        Args:
            credentials: List of credential metadata dicts with keys:
                credential_id, name, last_rotated, rotation_interval_days,
                is_valid, usage_count_24h, usage_count_avg_24h.
            current_time: Override for testing. Defaults to now.
 
        Returns:
            List of health check results, sorted by urgency.
        """
        now = current_time or datetime.utcnow()
        results = []
 
        for cred in credentials:
            last_rotated = cred.get("last_rotated")
            if isinstance(last_rotated, str):
                last_rotated = datetime.fromisoformat(last_rotated)
 
            days_since = (now - last_rotated).days if last_rotated else 999
            max_age = cred.get("rotation_interval_days", 90)
            overdue = days_since > max_age
 
            # Anomaly detection: usage more than 3x the average
            usage_24h = cred.get("usage_count_24h", 0)
            usage_avg = cred.get("usage_count_avg_24h", 1)
            anomalous = usage_24h > (usage_avg * 3) and usage_24h > 100
 
            # Determine overall health
            is_valid = cred.get("is_valid", True)
            if not is_valid:
                status = "critical"
            elif overdue and days_since > max_age * 1.5:
                status = "critical"
            elif overdue or anomalous:
                status = "warning"
            else:
                status = "healthy"
 
            check = CredentialHealthCheck(
                credential_id=cred["credential_id"],
                name=cred["name"],
                is_valid=is_valid,
                days_since_rotation=days_since,
                rotation_overdue=overdue,
                max_age_days=max_age,
                anomalous_usage=anomalous,
                usage_count_24h=usage_24h,
                usage_count_avg_24h=usage_avg,
                health_status=status,
            )
            results.append(check)
 
            # Fire alerts for non-healthy credentials
            if status in ("warning", "critical") and self.alert_callback:
                self.alert_callback(check)
 
        # Sort by urgency (most urgent first)
        results.sort(key=lambda c: c.urgency_score, reverse=True)
        return results
 
    def generate_dashboard_data(
        self, checks: list[CredentialHealthCheck]
    ) -> dict:
        """Generate dashboard-ready summary data."""
        total = len(checks)
        healthy = sum(1 for c in checks if c.health_status == "healthy")
        warning = sum(1 for c in checks if c.health_status == "warning")
        critical = sum(1 for c in checks if c.health_status == "critical")
 
        return {
            "total_credentials": total,
            "healthy": healthy,
            "warning": warning,
            "critical": critical,
            "health_rate": healthy / total if total > 0 else 0,
            "overdue_rotations": sum(1 for c in checks if c.rotation_overdue),
            "anomalous_usage": sum(1 for c in checks if c.anomalous_usage),
            "top_urgency": [
                {
                    "credential_id": c.credential_id,
                    "name": c.name,
                    "status": c.health_status,
                    "urgency_score": c.urgency_score,
                    "days_since_rotation": c.days_since_rotation,
                }
                for c in checks[:5]  # Top 5 most urgent
            ],
        }

Rotation Schedules by Credential Type

Credential Type	Recommended Interval	Rotation Pattern	Special Considerations
Model provider API keys (OpenAI, Anthropic)	90 days	Dual-key	Supports multiple active keys; rotate without downtime
Cloud AI service keys (Bedrock, Azure OpenAI)	Use IAM roles instead	Role-based	Prefer temporary credentials via STS/managed identities
Vector database tokens (Pinecone, Weaviate)	90 days	Blue-green	May require client restart; schedule during maintenance
Fine-tuning service accounts	180 days	Job-aware	Schedule between fine-tuning runs
MCP server tokens	30 days	Dual-key	Short-lived tokens preferred; rotate frequently
Monitoring API keys	180 days	Simple swap	Low blast radius; can tolerate brief monitoring gaps
OAuth client secrets	365 days	Dual-client	Register new client, migrate, decommission old client

AI-Specific Rotation Challenges

Credential Caching in Model Serving Containers

Cloud AI inference containers — whether ECS tasks, Kubernetes pods, or serverless functions — often cache credentials at startup for performance reasons. When a credential is rotated in the secrets manager, the running container continues using the old credential until it is restarted or explicitly refreshed. This creates a window where the old credential must remain valid.

To handle this correctly, implement a credential refresh mechanism in your inference layer that periodically checks the secrets manager for updated values. Set the refresh interval shorter than the propagation wait in your rotation procedure. For example, if your rotation procedure waits 5 minutes for propagation, configure containers to refresh credentials every 3 minutes. This ensures all containers pick up the new credential before the old one is revoked.

For serverless AI inference (AWS Lambda, Azure Functions), the cold start mechanism naturally picks up new credentials from the environment. However, warm instances may cache credentials for the lifetime of the execution environment, which can persist for hours. Force credential refresh by implementing an in-process TTL on cached credentials.

Multi-Region Deployments

Organizations that deploy AI services across multiple cloud regions face additional complexity. A credential rotated in the primary region must propagate to all secondary regions before the old credential is revoked. Secrets Manager replication (available in all major clouds) handles the storage propagation, but the application-level refresh window must account for cross-region replication latency.

For critical AI credentials, implement a multi-region rotation workflow:

Create the new credential in the primary region
Wait for cross-region replication to complete (verify with a read from each secondary region)
Wait for application containers in all regions to refresh
Revoke the old credential

The total propagation window for multi-region deployments is typically 10-15 minutes, compared to 5 minutes for single-region deployments.

Agent and MCP Server Credentials

AI agent architectures that use the Model Context Protocol (MCP) or similar tool-use frameworks introduce credential chains. The agent itself has a credential to access the LLM, each MCP server has its own authentication token, and each MCP server may hold credentials for downstream services (databases, APIs, file systems). Rotating the agent's LLM credential is straightforward, but rotating an MCP server token may require restarting the agent's connection to that server, which interrupts any in-progress tool calls.

For MCP server credentials, implement a graceful rotation pattern:

Create a new MCP server instance with the new credentials
Configure the agent to connect to both old and new instances
Drain in-progress requests from the old instance
Remove the old instance from the agent's server configuration
Revoke the old credentials and decommission the old instance

This blue-green pattern for MCP servers avoids interrupting agent workflows during rotation.

Emergency Rotation Procedures

When a credential is suspected of being compromised, the standard rotation procedure is too slow. Emergency rotation must prioritize revocation speed over zero-downtime guarantees. The emergency rotation procedure:

Immediately create a new credential
Immediately revoke the compromised credential (accept downtime)
Update the secrets manager with the new credential
Force-restart all dependent services to pick up the new credential
Monitor for failures and address them reactively

This procedure prioritizes security over availability. The acceptable downtime window for emergency rotation should be documented in advance and agreed upon with application teams. For most AI applications, a 2-5 minute disruption during an emergency credential rotation is preferable to continued exposure of a compromised credential.

Testing Your Rotation Procedures

Rotation procedures that are not regularly tested will fail when they are needed most. Implement the following testing cadence:

Monthly dry runs: Execute the rotation procedure with dry_run=True to verify that all automation steps complete without errors.
Quarterly live rotations: Perform actual credential rotations in a staging environment that mirrors production. Verify zero-downtime behavior and measure propagation timing.
Annual emergency drill: Simulate a credential compromise scenario and execute the emergency rotation procedure end-to-end. Measure time-to-revocation and time-to-recovery.
Chaos testing: Inject failures into the rotation process (simulated secrets manager outage, network partition during propagation) to verify that rollback procedures work correctly.

Common Failure Modes

Understanding how rotation procedures fail helps design more resilient automation:

Partial rotation: The new key is created and stored but the old key is never revoked, leaving two active keys. This is the most benign failure mode but creates credential sprawl over time. Implement a reconciliation job that identifies credentials with multiple active keys and alerts the security team.

Premature revocation: The old key is revoked before all consumers pick up the new key, causing service outages. This typically happens when the propagation wait is too short or when a container has a longer credential cache TTL than expected. The fix is to increase the propagation wait and implement pre-revocation validation that checks all known consumers are using the new key.

Rotation loop: Automated rotation triggers an alert, which triggers a response, which triggers another rotation. This can happen when the monitoring system flags the expected post-rotation usage pattern change as anomalous. Configure alert suppression windows around scheduled rotations to prevent this cycle.

Secrets manager outage during rotation: If the secrets manager becomes unavailable between creating the new key and storing it, the new key exists at the provider but is not recorded anywhere. The automation must store the new key details in a fallback location (encrypted local file, separate secrets store) before creating it, enabling recovery from this state.

References

AWS. "Rotate AWS Secrets Manager Secrets," https://docs.aws.amazon.com/secretsmanager/latest/userguide/rotating-secrets.html
Microsoft. "Key Vault Secrets Rotation," https://learn.microsoft.com/en-us/azure/key-vault/secrets/tutorial-rotation
NIST SP 800-57, "Recommendation for Key Management," https://csrc.nist.gov/publications/detail/sp/800-57-part-1/rev-5/final
OpenAI. "API Key Best Practices," https://platform.openai.com/docs/guides/safety-best-practices

Edit this page on GitHub

Secrets Rotation for Cloud AI Deployments

advanced18 min readUpdated 2026-03-21

Implementing automated secrets rotation strategies for API keys, model endpoint credentials, and service accounts used in cloud AI/LLM deployments across AWS, Azure, and GCP.

cloud secrets rotation credentials automation

Overview

AI-Specific Credential Taxonomy

Before implementing rotation, you must inventory every credential type in your AI deployment and understand its rotation constraints.

# secrets_rotation/taxonomy.py
"""Credential taxonomy and rotation constraints for AI deployments."""
 
from __future__ import annotations
 
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
 
class CredentialType(Enum):
    """Types of credentials used in cloud AI deployments."""
    MODEL_PROVIDER_API_KEY = "model_provider_api_key"
    CLOUD_AI_SERVICE_KEY = "cloud_ai_service_key"
    CLOUD_SERVICE_ACCOUNT = "cloud_service_account"
    VECTOR_DB_TOKEN = "vector_db_token"
    EMBEDDING_SERVICE_KEY = "embedding_service_key"
    GUARDRAIL_SERVICE_KEY = "guardrail_service_key"
    MONITORING_API_KEY = "monitoring_api_key"
    OAUTH_CLIENT_SECRET = "oauth_client_secret"
    FINE_TUNING_SERVICE_ACCOUNT = "fine_tuning_service_account"
    MCP_SERVER_TOKEN = "mcp_server_token"
 
class RotationConstraint(Enum):
    """Constraints that affect how a credential can be rotated."""
    SUPPORTS_DUAL_KEY = "supports_dual_key"        # Provider supports two active keys simultaneously
    IMMEDIATE_REVOCATION = "immediate_revocation"    # Old key stops working immediately after rotation
    GRACE_PERIOD = "grace_period"                    # Old key works for a configurable period after rotation
    LONG_RUNNING_JOBS = "long_running_jobs"          # Credential may be held by long-running processes
    CACHED_IN_CONTAINERS = "cached_in_containers"    # Credential may be cached in warm serving containers
    MULTI_SERVICE_DEPENDENCY = "multi_service"       # Multiple services use this credential
    REQUIRES_RESTART = "requires_restart"            # Service must restart to pick up new credential
 
@dataclass
class AICredential:
    """A credential used in an AI deployment with rotation metadata."""
    credential_id: str
    name: str
    credential_type: CredentialType
    secrets_manager_arn: str          # Where the credential is stored
    rotation_interval_days: int       # How often to rotate
    last_rotated: Optional[str] = None
    rotation_constraints: list[RotationConstraint] = field(default_factory=list)
    dependent_services: list[str] = field(default_factory=list)
    blast_radius: str = "low"         # low, medium, high, critical
    rotation_tested: bool = False
    notes: str = ""
 
# Credential catalog for a typical enterprise AI deployment
CREDENTIAL_CATALOG: list[AICredential] = [
    AICredential(
        credential_id="cred-openai-api-key",
        name="OpenAI API Key (Production)",
        credential_type=CredentialType.MODEL_PROVIDER_API_KEY,
        secrets_manager_arn="arn:aws:secretsmanager:us-east-1:123456789:secret:ai/openai-api-key",
        rotation_interval_days=90,
        rotation_constraints=[
            RotationConstraint.SUPPORTS_DUAL_KEY,  # OpenAI allows multiple active keys
            RotationConstraint.CACHED_IN_CONTAINERS,
            RotationConstraint.MULTI_SERVICE_DEPENDENCY,
        ],
        dependent_services=["chatbot-api", "rag-pipeline", "eval-service"],
        blast_radius="critical",
        notes="OpenAI allows creating new keys without revoking old ones. Use dual-key rotation.",
    ),
    AICredential(
        credential_id="cred-pinecone-token",
        name="Pinecone Vector DB API Key",
        credential_type=CredentialType.VECTOR_DB_TOKEN,
        secrets_manager_arn="arn:aws:secretsmanager:us-east-1:123456789:secret:ai/pinecone-key",
        rotation_interval_days=90,
        rotation_constraints=[
            RotationConstraint.IMMEDIATE_REVOCATION,
            RotationConstraint.CACHED_IN_CONTAINERS,
        ],
        dependent_services=["rag-pipeline", "embedding-indexer"],
        blast_radius="high",
        notes="Pinecone key rotation invalidates the old key immediately. Requires careful coordination.",
    ),
    AICredential(
        credential_id="cred-bedrock-service-role",
        name="AWS Bedrock Service Role",
        credential_type=CredentialType.CLOUD_SERVICE_ACCOUNT,
        secrets_manager_arn="",  # IAM role — no secret stored
        rotation_interval_days=365,
        rotation_constraints=[
            RotationConstraint.LONG_RUNNING_JOBS,
        ],
        dependent_services=["bedrock-inference", "bedrock-fine-tuning"],
        blast_radius="critical",
        notes="IAM role uses temporary credentials via STS. Rotate access key if using static keys.",
    ),
    AICredential(
        credential_id="cred-finetuning-sa",
        name="Fine-tuning Pipeline Service Account",
        credential_type=CredentialType.FINE_TUNING_SERVICE_ACCOUNT,
        secrets_manager_arn="arn:aws:secretsmanager:us-east-1:123456789:secret:ai/finetune-sa",
        rotation_interval_days=180,
        rotation_constraints=[
            RotationConstraint.LONG_RUNNING_JOBS,
            RotationConstraint.REQUIRES_RESTART,
        ],
        dependent_services=["fine-tuning-pipeline"],
        blast_radius="medium",
        notes="Fine-tuning jobs hold credentials for 2-48 hours. Schedule rotation during job gaps.",
    ),
]

Rotation Strategy Implementation

The Dual-Key Rotation Pattern

For credentials that support multiple active keys simultaneously (OpenAI, most cloud providers), use the dual-key pattern that enables zero-downtime rotation.

# secrets_rotation/dual_key_rotator.py
"""Dual-key rotation for AI service credentials with zero downtime."""
 
from __future__ import annotations
 
import json
import logging
import time
from abc import ABC, abstractmethod
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
 
import boto3
 
logger = logging.getLogger(__name__)
 
@dataclass
class RotationResult:
    """Result of a credential rotation operation."""
    credential_id: str
    success: bool
    old_key_revoked: bool
    new_key_validated: bool
    rollback_performed: bool
    duration_seconds: float
    error: Optional[str] = None
    timestamp: str = ""
 
    def __post_init__(self):
        if not self.timestamp:
            self.timestamp = datetime.utcnow().isoformat()
 
class CredentialProvider(ABC):
    """Abstract interface for credential providers that support rotation."""
 
    @abstractmethod
    def create_new_key(self) -> dict:
        """Create a new API key/credential. Returns key metadata."""
        ...
 
    @abstractmethod
    def validate_key(self, key: str) -> bool:
        """Validate that a key is functional by making a test API call."""
        ...
 
    @abstractmethod
    def revoke_key(self, key_id: str) -> bool:
        """Revoke an old API key."""
        ...
 
    @abstractmethod
    def list_active_keys(self) -> list[dict]:
        """List all active keys for this credential."""
        ...
 
class OpenAIKeyProvider(CredentialProvider):
    """Manages OpenAI API key rotation."""
 
    def __init__(self, admin_key: str, organization_id: str):
        self.admin_key = admin_key
        self.org_id = organization_id
 
    def create_new_key(self) -> dict:
        """Create a new OpenAI API key via the admin API."""
        import httpx
 
        response = httpx.post(
            "https://api.openai.com/v1/organization/api_keys",
            headers={
                "Authorization": f"Bearer {self.admin_key}",
                "OpenAI-Organization": self.org_id,
            },
            json={
                "name": f"auto-rotated-{datetime.utcnow().strftime('%Y%m%d-%H%M%S')}",
            },
        )
        response.raise_for_status()
        data = response.json()
        return {"key_id": data["id"], "key_value": data["key"]}
 
    def validate_key(self, key: str) -> bool:
        """Validate an OpenAI key by listing models."""
        import httpx
 
        try:
            response = httpx.get(
                "https://api.openai.com/v1/models",
                headers={"Authorization": f"Bearer {key}"},
                timeout=10,
            )
            return response.status_code == 200
        except Exception:
            return False
 
    def revoke_key(self, key_id: str) -> bool:
        """Revoke an OpenAI API key."""
        import httpx
 
        try:
            response = httpx.delete(
                f"https://api.openai.com/v1/organization/api_keys/{key_id}",
                headers={
                    "Authorization": f"Bearer {self.admin_key}",
                    "OpenAI-Organization": self.org_id,
                },
            )
            return response.status_code in (200, 204)
        except Exception:
            return False
 
    def list_active_keys(self) -> list[dict]:
        """List all active OpenAI API keys."""
        import httpx
 
        response = httpx.get(
            "https://api.openai.com/v1/organization/api_keys",
            headers={
                "Authorization": f"Bearer {self.admin_key}",
                "OpenAI-Organization": self.org_id,
            },
        )
        response.raise_for_status()
        return response.json().get("data", [])
 
class DualKeyRotator:
    """
    Implements zero-downtime dual-key rotation for AI service credentials.
 
    The rotation follows a five-phase process:
    1. CREATE: Generate a new credential
    2. VALIDATE: Verify the new credential works
    3. UPDATE: Store the new credential in secrets manager
    4. PROPAGATE: Wait for all consumers to pick up the new credential
    5. REVOKE: Delete the old credential
    """
 
    def __init__(
        self,
        secrets_client: boto3.client,
        credential_provider: CredentialProvider,
        propagation_wait_seconds: int = 300,
    ):
        self.secrets_client = secrets_client
        self.provider = credential_provider
        self.propagation_wait = propagation_wait_seconds
 
    def rotate(
        self,
        secret_arn: str,
        credential_id: str,
        dry_run: bool = False,
    ) -> RotationResult:
        """
        Execute the full dual-key rotation process.
 
        Args:
            secret_arn: AWS Secrets Manager ARN for the credential.
            credential_id: Internal identifier for the credential.
            dry_run: If True, validate but do not actually rotate.
 
        Returns:
            RotationResult with success/failure status and details.
        """
        start_time = time.monotonic()
 
        try:
            # Phase 1: READ current secret
            logger.info(f"[{credential_id}] Phase 1: Reading current secret")
            current_secret = self._read_secret(secret_arn)
            old_key_id = current_secret.get("key_id", "")
            old_key_value = current_secret.get("key_value", "")
 
            if dry_run:
                # In dry run, just validate the current key
                is_valid = self.provider.validate_key(old_key_value)
                elapsed = time.monotonic() - start_time
                return RotationResult(
                    credential_id=credential_id,
                    success=True,
                    old_key_revoked=False,
                    new_key_validated=is_valid,
                    rollback_performed=False,
                    duration_seconds=elapsed,
                )
 
            # Phase 2: CREATE new key
            logger.info(f"[{credential_id}] Phase 2: Creating new key")
            new_key = self.provider.create_new_key()
 
            # Phase 3: VALIDATE new key
            logger.info(f"[{credential_id}] Phase 3: Validating new key")
            if not self.provider.validate_key(new_key["key_value"]):
                # New key doesn't work — abort and revoke it
                logger.error(f"[{credential_id}] New key validation failed, aborting")
                self.provider.revoke_key(new_key["key_id"])
                elapsed = time.monotonic() - start_time
                return RotationResult(
                    credential_id=credential_id,
                    success=False,
                    old_key_revoked=False,
                    new_key_validated=False,
                    rollback_performed=True,
                    duration_seconds=elapsed,
                    error="New key validation failed",
                )
 
            # Phase 4: UPDATE secrets manager with new key
            logger.info(f"[{credential_id}] Phase 4: Updating secrets manager")
            self._update_secret(secret_arn, {
                "key_id": new_key["key_id"],
                "key_value": new_key["key_value"],
                "rotated_at": datetime.utcnow().isoformat(),
                "previous_key_id": old_key_id,
            })
 
            # Phase 5: PROPAGATE — wait for consumers to pick up new key
            logger.info(
                f"[{credential_id}] Phase 5: Waiting {self.propagation_wait}s for propagation"
            )
            time.sleep(self.propagation_wait)
 
            # Phase 6: REVOKE old key
            logger.info(f"[{credential_id}] Phase 6: Revoking old key")
            old_revoked = False
            if old_key_id:
                old_revoked = self.provider.revoke_key(old_key_id)
                if not old_revoked:
                    logger.warning(
                        f"[{credential_id}] Failed to revoke old key {old_key_id}. "
                        "Manual cleanup required."
                    )
 
            elapsed = time.monotonic() - start_time
            return RotationResult(
                credential_id=credential_id,
                success=True,
                old_key_revoked=old_revoked,
                new_key_validated=True,
                rollback_performed=False,
                duration_seconds=elapsed,
            )
 
        except Exception as e:
            elapsed = time.monotonic() - start_time
            logger.exception(f"[{credential_id}] Rotation failed: {e}")
            return RotationResult(
                credential_id=credential_id,
                success=False,
                old_key_revoked=False,
                new_key_validated=False,
                rollback_performed=False,
                duration_seconds=elapsed,
                error=str(e),
            )
 
    def _read_secret(self, secret_arn: str) -> dict:
        """Read the current secret value from Secrets Manager."""
        response = self.secrets_client.get_secret_value(SecretId=secret_arn)
        return json.loads(response["SecretString"])
 
    def _update_secret(self, secret_arn: str, new_value: dict) -> None:
        """Update the secret value in Secrets Manager."""
        self.secrets_client.put_secret_value(
            SecretId=secret_arn,
            SecretString=json.dumps(new_value),
        )

Handling Long-Running AI Jobs

Fine-tuning jobs, batch evaluation pipelines, and continuous embedding indexers hold credentials for extended periods. Rotating credentials while these jobs are running requires careful coordination.

# secrets_rotation/job_aware_rotator.py
"""Job-aware rotation that coordinates with long-running AI workloads."""
 
from __future__ import annotations
 
import logging
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
 
logger = logging.getLogger(__name__)
 
@dataclass
class RunningJob:
    """A long-running AI job that holds credentials."""
    job_id: str
    job_type: str              # fine_tuning, batch_inference, embedding_index
    credential_ids: list[str]  # Which credentials this job uses
    started_at: datetime
    estimated_completion: Optional[datetime]
    can_be_interrupted: bool
    checkpoint_available: bool  # Can the job resume from a checkpoint?
 
class JobAwareRotationScheduler:
    """Schedules credential rotation around long-running AI jobs."""
 
    def __init__(self, max_rotation_delay_hours: int = 48):
        self.max_delay = timedelta(hours=max_rotation_delay_hours)
        self._running_jobs: list[RunningJob] = []
 
    def register_job(self, job: RunningJob):
        """Register a long-running job that holds credentials."""
        self._running_jobs.append(job)
        logger.info(
            f"Registered job {job.job_id} holding credentials: {job.credential_ids}"
        )
 
    def unregister_job(self, job_id: str):
        """Remove a completed job from the tracker."""
        self._running_jobs = [j for j in self._running_jobs if j.job_id != job_id]
 
    def can_rotate_now(self, credential_id: str) -> dict:
        """
        Check if a credential can be safely rotated right now.
 
        Returns a decision dict with rotation guidance.
        """
        blocking_jobs = [
            j for j in self._running_jobs
            if credential_id in j.credential_ids
        ]
 
        if not blocking_jobs:
            return {
                "can_rotate": True,
                "blocking_jobs": [],
                "recommended_action": "Proceed with immediate rotation",
            }
 
        # Check if all blocking jobs can be interrupted safely
        all_interruptible = all(j.can_be_interrupted for j in blocking_jobs)
        all_checkpointed = all(j.checkpoint_available for j in blocking_jobs)
 
        # Find the earliest completion time among blocking jobs
        completions = [
            j.estimated_completion for j in blocking_jobs
            if j.estimated_completion
        ]
        earliest_completion = min(completions) if completions else None
 
        now = datetime.utcnow()
 
        if all_interruptible and all_checkpointed:
            return {
                "can_rotate": True,
                "blocking_jobs": [j.job_id for j in blocking_jobs],
                "recommended_action": (
                    "Jobs are interruptible with checkpoints. "
                    "Interrupt jobs, rotate credentials, resume from checkpoint."
                ),
                "pre_rotation_steps": [
                    f"Checkpoint job {j.job_id}" for j in blocking_jobs
                ],
            }
 
        if earliest_completion and (earliest_completion - now) < self.max_delay:
            return {
                "can_rotate": False,
                "blocking_jobs": [j.job_id for j in blocking_jobs],
                "recommended_action": (
                    f"Delay rotation until {earliest_completion.isoformat()}. "
                    f"Blocking jobs will complete within the allowed delay window."
                ),
                "retry_after": earliest_completion.isoformat(),
            }
 
        # Jobs are running too long — must force rotation for security
        return {
            "can_rotate": True,
            "blocking_jobs": [j.job_id for j in blocking_jobs],
            "recommended_action": (
                "FORCE ROTATION: Blocking jobs exceed maximum rotation delay. "
                "Security policy requires rotation. Accept that jobs will fail."
            ),
            "force_reason": (
                f"Jobs have been running since "
                f"{min(j.started_at for j in blocking_jobs).isoformat()} "
                f"and exceed the {self.max_delay} maximum delay policy."
            ),
            "post_rotation_steps": [
                f"Restart job {j.job_id} with new credentials"
                for j in blocking_jobs
            ],
        }

Monitoring and Alerting

# secrets_rotation/monitoring.py
"""Monitoring and alerting for AI credential rotation health."""
 
from __future__ import annotations
 
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Optional
 
@dataclass
class CredentialHealthCheck:
    """Health status of a managed credential."""
    credential_id: str
    name: str
    is_valid: bool
    days_since_rotation: int
    rotation_overdue: bool
    max_age_days: int
    anomalous_usage: bool
    usage_count_24h: int
    usage_count_avg_24h: float
    health_status: str       # healthy, warning, critical, expired
 
    @property
    def urgency_score(self) -> float:
        """Calculate urgency for rotation prioritization (0-100)."""
        score = 0
        if self.rotation_overdue:
            overdue_days = self.days_since_rotation - self.max_age_days
            score += min(50, overdue_days * 5)
        if self.anomalous_usage:
            score += 30
        if not self.is_valid:
            score = 100  # Invalid credential is maximum urgency
        return min(100, score)
 
class RotationHealthMonitor:
    """Monitors the health of all managed AI credentials."""
 
    def __init__(self, alert_callback=None):
        self.alert_callback = alert_callback
        self._health_history: list[dict] = []
 
    def check_all(
        self,
        credentials: list[dict],
        current_time: Optional[datetime] = None,
    ) -> list[CredentialHealthCheck]:
        """
        Run health checks on all managed credentials.
 
        Args:
            credentials: List of credential metadata dicts with keys:
                credential_id, name, last_rotated, rotation_interval_days,
                is_valid, usage_count_24h, usage_count_avg_24h.
            current_time: Override for testing. Defaults to now.
 
        Returns:
            List of health check results, sorted by urgency.
        """
        now = current_time or datetime.utcnow()
        results = []
 
        for cred in credentials:
            last_rotated = cred.get("last_rotated")
            if isinstance(last_rotated, str):
                last_rotated = datetime.fromisoformat(last_rotated)
 
            days_since = (now - last_rotated).days if last_rotated else 999
            max_age = cred.get("rotation_interval_days", 90)
            overdue = days_since > max_age
 
            # Anomaly detection: usage more than 3x the average
            usage_24h = cred.get("usage_count_24h", 0)
            usage_avg = cred.get("usage_count_avg_24h", 1)
            anomalous = usage_24h > (usage_avg * 3) and usage_24h > 100
 
            # Determine overall health
            is_valid = cred.get("is_valid", True)
            if not is_valid:
                status = "critical"
            elif overdue and days_since > max_age * 1.5:
                status = "critical"
            elif overdue or anomalous:
                status = "warning"
            else:
                status = "healthy"
 
            check = CredentialHealthCheck(
                credential_id=cred["credential_id"],
                name=cred["name"],
                is_valid=is_valid,
                days_since_rotation=days_since,
                rotation_overdue=overdue,
                max_age_days=max_age,
                anomalous_usage=anomalous,
                usage_count_24h=usage_24h,
                usage_count_avg_24h=usage_avg,
                health_status=status,
            )
            results.append(check)
 
            # Fire alerts for non-healthy credentials
            if status in ("warning", "critical") and self.alert_callback:
                self.alert_callback(check)
 
        # Sort by urgency (most urgent first)
        results.sort(key=lambda c: c.urgency_score, reverse=True)
        return results
 
    def generate_dashboard_data(
        self, checks: list[CredentialHealthCheck]
    ) -> dict:
        """Generate dashboard-ready summary data."""
        total = len(checks)
        healthy = sum(1 for c in checks if c.health_status == "healthy")
        warning = sum(1 for c in checks if c.health_status == "warning")
        critical = sum(1 for c in checks if c.health_status == "critical")
 
        return {
            "total_credentials": total,
            "healthy": healthy,
            "warning": warning,
            "critical": critical,
            "health_rate": healthy / total if total > 0 else 0,
            "overdue_rotations": sum(1 for c in checks if c.rotation_overdue),
            "anomalous_usage": sum(1 for c in checks if c.anomalous_usage),
            "top_urgency": [
                {
                    "credential_id": c.credential_id,
                    "name": c.name,
                    "status": c.health_status,
                    "urgency_score": c.urgency_score,
                    "days_since_rotation": c.days_since_rotation,
                }
                for c in checks[:5]  # Top 5 most urgent
            ],
        }

Rotation Schedules by Credential Type

Credential Type	Recommended Interval	Rotation Pattern	Special Considerations
Model provider API keys (OpenAI, Anthropic)	90 days	Dual-key	Supports multiple active keys; rotate without downtime
Cloud AI service keys (Bedrock, Azure OpenAI)	Use IAM roles instead	Role-based	Prefer temporary credentials via STS/managed identities
Vector database tokens (Pinecone, Weaviate)	90 days	Blue-green	May require client restart; schedule during maintenance
Fine-tuning service accounts	180 days	Job-aware	Schedule between fine-tuning runs
MCP server tokens	30 days	Dual-key	Short-lived tokens preferred; rotate frequently
Monitoring API keys	180 days	Simple swap	Low blast radius; can tolerate brief monitoring gaps
OAuth client secrets	365 days	Dual-client	Register new client, migrate, decommission old client

AI-Specific Rotation Challenges

Credential Caching in Model Serving Containers

Multi-Region Deployments

For critical AI credentials, implement a multi-region rotation workflow:

Create the new credential in the primary region
Wait for cross-region replication to complete (verify with a read from each secondary region)
Wait for application containers in all regions to refresh
Revoke the old credential

The total propagation window for multi-region deployments is typically 10-15 minutes, compared to 5 minutes for single-region deployments.

Agent and MCP Server Credentials

For MCP server credentials, implement a graceful rotation pattern:

Create a new MCP server instance with the new credentials
Configure the agent to connect to both old and new instances
Drain in-progress requests from the old instance
Remove the old instance from the agent's server configuration
Revoke the old credentials and decommission the old instance

This blue-green pattern for MCP servers avoids interrupting agent workflows during rotation.

Emergency Rotation Procedures

Immediately create a new credential
Immediately revoke the compromised credential (accept downtime)
Update the secrets manager with the new credential
Force-restart all dependent services to pick up the new credential
Monitor for failures and address them reactively

Testing Your Rotation Procedures

Rotation procedures that are not regularly tested will fail when they are needed most. Implement the following testing cadence:

Monthly dry runs: Execute the rotation procedure with dry_run=True to verify that all automation steps complete without errors.
Quarterly live rotations: Perform actual credential rotations in a staging environment that mirrors production. Verify zero-downtime behavior and measure propagation timing.
Annual emergency drill: Simulate a credential compromise scenario and execute the emergency rotation procedure end-to-end. Measure time-to-revocation and time-to-recovery.
Chaos testing: Inject failures into the rotation process (simulated secrets manager outage, network partition during propagation) to verify that rollback procedures work correctly.

Common Failure Modes

Understanding how rotation procedures fail helps design more resilient automation:

References

AWS. "Rotate AWS Secrets Manager Secrets," https://docs.aws.amazon.com/secretsmanager/latest/userguide/rotating-secrets.html
Microsoft. "Key Vault Secrets Rotation," https://learn.microsoft.com/en-us/azure/key-vault/secrets/tutorial-rotation
NIST SP 800-57, "Recommendation for Key Management," https://csrc.nist.gov/publications/detail/sp/800-57-part-1/rev-5/final
OpenAI. "API Key Best Practices," https://platform.openai.com/docs/guides/safety-best-practices

Edit this page on GitHub

Secrets Rotation for Cloud AI Deployments

Related articles

Secrets Rotation for Cloud AI Deployments

Related articles