Secrets Rotation for Cloud AI Deployments
Implementing automated secrets rotation strategies for API keys, model endpoint credentials, and service accounts used in cloud AI/LLM deployments across AWS, Azure, and GCP.
Overview
Cloud AI deployments present unique secrets management challenges that traditional rotation strategies do not adequately address. A standard web application might have a database password and an API key to rotate. An enterprise LLM application can have dozens of credential types: model provider API keys (OpenAI, Anthropic, Cohere), cloud AI service credentials (Bedrock, Azure OpenAI, Vertex AI), vector database authentication tokens, embedding service keys, guardrail service credentials, monitoring API tokens, and service accounts for fine-tuning pipelines. Each credential type has different rotation constraints, blast radius considerations, and downtime implications.
The problem is compounded by AI-specific operational patterns. Fine-tuning jobs run for hours or days and hold credentials for the entire duration. RAG pipelines maintain persistent connections to vector databases. Agent orchestration frameworks cache credentials across multiple tool calls. Model serving endpoints may cache authentication tokens in warm inference containers. A naive rotation strategy that simply swaps credentials on a schedule will cause cascading failures across these interdependent services.
This article presents a comprehensive secrets rotation framework designed specifically for cloud AI deployments. The framework handles the full lifecycle: credential inventory, rotation scheduling, zero-downtime swap procedures, validation, rollback, and monitoring. It is implemented using cloud-native secrets managers and can be adapted to any cloud provider.
AI-Specific Credential Taxonomy
Before implementing rotation, you must inventory every credential type in your AI deployment and understand its rotation constraints.
# secrets_rotation/taxonomy.py
"""Credential taxonomy and rotation constraints for AI deployments."""
from __future__ import annotations
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
class CredentialType(Enum):
"""Types of credentials used in cloud AI deployments."""
MODEL_PROVIDER_API_KEY = "model_provider_api_key"
CLOUD_AI_SERVICE_KEY = "cloud_ai_service_key"
CLOUD_SERVICE_ACCOUNT = "cloud_service_account"
VECTOR_DB_TOKEN = "vector_db_token"
EMBEDDING_SERVICE_KEY = "embedding_service_key"
GUARDRAIL_SERVICE_KEY = "guardrail_service_key"
MONITORING_API_KEY = "monitoring_api_key"
OAUTH_CLIENT_SECRET = "oauth_client_secret"
FINE_TUNING_SERVICE_ACCOUNT = "fine_tuning_service_account"
MCP_SERVER_TOKEN = "mcp_server_token"
class RotationConstraint(Enum):
"""Constraints that affect how a credential can be rotated."""
SUPPORTS_DUAL_KEY = "supports_dual_key" # Provider supports two active keys simultaneously
IMMEDIATE_REVOCATION = "immediate_revocation" # Old key stops working immediately after rotation
GRACE_PERIOD = "grace_period" # Old key works for a configurable period after rotation
LONG_RUNNING_JOBS = "long_running_jobs" # Credential may be held by long-running processes
CACHED_IN_CONTAINERS = "cached_in_containers" # Credential may be cached in warm serving containers
MULTI_SERVICE_DEPENDENCY = "multi_service" # Multiple services use this credential
REQUIRES_RESTART = "requires_restart" # Service must restart to pick up new credential
@dataclass
class AICredential:
"""A credential used in an AI deployment with rotation metadata."""
credential_id: str
name: str
credential_type: CredentialType
secrets_manager_arn: str # Where the credential is stored
rotation_interval_days: int # How often to rotate
last_rotated: Optional[str] = None
rotation_constraints: list[RotationConstraint] = field(default_factory=list)
dependent_services: list[str] = field(default_factory=list)
blast_radius: str = "low" # low, medium, high, critical
rotation_tested: bool = False
notes: str = ""
# Credential catalog for a typical enterprise AI deployment
CREDENTIAL_CATALOG: list[AICredential] = [
AICredential(
credential_id="cred-openai-api-key",
name="OpenAI API Key (Production)",
credential_type=CredentialType.MODEL_PROVIDER_API_KEY,
secrets_manager_arn="arn:aws:secretsmanager:us-east-1:123456789:secret:ai/openai-api-key",
rotation_interval_days=90,
rotation_constraints=[
RotationConstraint.SUPPORTS_DUAL_KEY, # OpenAI allows multiple active keys
RotationConstraint.CACHED_IN_CONTAINERS,
RotationConstraint.MULTI_SERVICE_DEPENDENCY,
],
dependent_services=["chatbot-api", "rag-pipeline", "eval-service"],
blast_radius="critical",
notes="OpenAI allows creating new keys without revoking old ones. Use dual-key rotation.",
),
AICredential(
credential_id="cred-pinecone-token",
name="Pinecone Vector DB API Key",
credential_type=CredentialType.VECTOR_DB_TOKEN,
secrets_manager_arn="arn:aws:secretsmanager:us-east-1:123456789:secret:ai/pinecone-key",
rotation_interval_days=90,
rotation_constraints=[
RotationConstraint.IMMEDIATE_REVOCATION,
RotationConstraint.CACHED_IN_CONTAINERS,
],
dependent_services=["rag-pipeline", "embedding-indexer"],
blast_radius="high",
notes="Pinecone key rotation invalidates the old key immediately. Requires careful coordination.",
),
AICredential(
credential_id="cred-bedrock-service-role",
name="AWS Bedrock Service Role",
credential_type=CredentialType.CLOUD_SERVICE_ACCOUNT,
secrets_manager_arn="", # IAM role — no secret stored
rotation_interval_days=365,
rotation_constraints=[
RotationConstraint.LONG_RUNNING_JOBS,
],
dependent_services=["bedrock-inference", "bedrock-fine-tuning"],
blast_radius="critical",
notes="IAM role uses temporary credentials via STS. Rotate access key if using static keys.",
),
AICredential(
credential_id="cred-finetuning-sa",
name="Fine-tuning Pipeline Service Account",
credential_type=CredentialType.FINE_TUNING_SERVICE_ACCOUNT,
secrets_manager_arn="arn:aws:secretsmanager:us-east-1:123456789:secret:ai/finetune-sa",
rotation_interval_days=180,
rotation_constraints=[
RotationConstraint.LONG_RUNNING_JOBS,
RotationConstraint.REQUIRES_RESTART,
],
dependent_services=["fine-tuning-pipeline"],
blast_radius="medium",
notes="Fine-tuning jobs hold credentials for 2-48 hours. Schedule rotation during job gaps.",
),
]Rotation Strategy Implementation
The Dual-Key Rotation Pattern
For credentials that support multiple active keys simultaneously (OpenAI, most cloud providers), use the dual-key pattern that enables zero-downtime rotation.
# secrets_rotation/dual_key_rotator.py
"""Dual-key rotation for AI service credentials with zero downtime."""
from __future__ import annotations
import json
import logging
import time
from abc import ABC, abstractmethod
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
import boto3
logger = logging.getLogger(__name__)
@dataclass
class RotationResult:
"""Result of a credential rotation operation."""
credential_id: str
success: bool
old_key_revoked: bool
new_key_validated: bool
rollback_performed: bool
duration_seconds: float
error: Optional[str] = None
timestamp: str = ""
def __post_init__(self):
if not self.timestamp:
self.timestamp = datetime.utcnow().isoformat()
class CredentialProvider(ABC):
"""Abstract interface for credential providers that support rotation."""
@abstractmethod
def create_new_key(self) -> dict:
"""Create a new API key/credential. Returns key metadata."""
...
@abstractmethod
def validate_key(self, key: str) -> bool:
"""Validate that a key is functional by making a test API call."""
...
@abstractmethod
def revoke_key(self, key_id: str) -> bool:
"""Revoke an old API key."""
...
@abstractmethod
def list_active_keys(self) -> list[dict]:
"""List all active keys for this credential."""
...
class OpenAIKeyProvider(CredentialProvider):
"""Manages OpenAI API key rotation."""
def __init__(self, admin_key: str, organization_id: str):
self.admin_key = admin_key
self.org_id = organization_id
def create_new_key(self) -> dict:
"""Create a new OpenAI API key via the admin API."""
import httpx
response = httpx.post(
"https://api.openai.com/v1/organization/api_keys",
headers={
"Authorization": f"Bearer {self.admin_key}",
"OpenAI-Organization": self.org_id,
},
json={
"name": f"auto-rotated-{datetime.utcnow().strftime('%Y%m%d-%H%M%S')}",
},
)
response.raise_for_status()
data = response.json()
return {"key_id": data["id"], "key_value": data["key"]}
def validate_key(self, key: str) -> bool:
"""Validate an OpenAI key by listing models."""
import httpx
try:
response = httpx.get(
"https://api.openai.com/v1/models",
headers={"Authorization": f"Bearer {key}"},
timeout=10,
)
return response.status_code == 200
except Exception:
return False
def revoke_key(self, key_id: str) -> bool:
"""Revoke an OpenAI API key."""
import httpx
try:
response = httpx.delete(
f"https://api.openai.com/v1/organization/api_keys/{key_id}",
headers={
"Authorization": f"Bearer {self.admin_key}",
"OpenAI-Organization": self.org_id,
},
)
return response.status_code in (200, 204)
except Exception:
return False
def list_active_keys(self) -> list[dict]:
"""List all active OpenAI API keys."""
import httpx
response = httpx.get(
"https://api.openai.com/v1/organization/api_keys",
headers={
"Authorization": f"Bearer {self.admin_key}",
"OpenAI-Organization": self.org_id,
},
)
response.raise_for_status()
return response.json().get("data", [])
class DualKeyRotator:
"""
Implements zero-downtime dual-key rotation for AI service credentials.
The rotation follows a five-phase process:
1. CREATE: Generate a new credential
2. VALIDATE: Verify the new credential works
3. UPDATE: Store the new credential in secrets manager
4. PROPAGATE: Wait for all consumers to pick up the new credential
5. REVOKE: Delete the old credential
"""
def __init__(
self,
secrets_client: boto3.client,
credential_provider: CredentialProvider,
propagation_wait_seconds: int = 300,
):
self.secrets_client = secrets_client
self.provider = credential_provider
self.propagation_wait = propagation_wait_seconds
def rotate(
self,
secret_arn: str,
credential_id: str,
dry_run: bool = False,
) -> RotationResult:
"""
Execute the full dual-key rotation process.
Args:
secret_arn: AWS Secrets Manager ARN for the credential.
credential_id: Internal identifier for the credential.
dry_run: If True, validate but do not actually rotate.
Returns:
RotationResult with success/failure status and details.
"""
start_time = time.monotonic()
try:
# Phase 1: READ current secret
logger.info(f"[{credential_id}] Phase 1: Reading current secret")
current_secret = self._read_secret(secret_arn)
old_key_id = current_secret.get("key_id", "")
old_key_value = current_secret.get("key_value", "")
if dry_run:
# In dry run, just validate the current key
is_valid = self.provider.validate_key(old_key_value)
elapsed = time.monotonic() - start_time
return RotationResult(
credential_id=credential_id,
success=True,
old_key_revoked=False,
new_key_validated=is_valid,
rollback_performed=False,
duration_seconds=elapsed,
)
# Phase 2: CREATE new key
logger.info(f"[{credential_id}] Phase 2: Creating new key")
new_key = self.provider.create_new_key()
# Phase 3: VALIDATE new key
logger.info(f"[{credential_id}] Phase 3: Validating new key")
if not self.provider.validate_key(new_key["key_value"]):
# New key doesn't work — abort and revoke it
logger.error(f"[{credential_id}] New key validation failed, aborting")
self.provider.revoke_key(new_key["key_id"])
elapsed = time.monotonic() - start_time
return RotationResult(
credential_id=credential_id,
success=False,
old_key_revoked=False,
new_key_validated=False,
rollback_performed=True,
duration_seconds=elapsed,
error="New key validation failed",
)
# Phase 4: UPDATE secrets manager with new key
logger.info(f"[{credential_id}] Phase 4: Updating secrets manager")
self._update_secret(secret_arn, {
"key_id": new_key["key_id"],
"key_value": new_key["key_value"],
"rotated_at": datetime.utcnow().isoformat(),
"previous_key_id": old_key_id,
})
# Phase 5: PROPAGATE — wait for consumers to pick up new key
logger.info(
f"[{credential_id}] Phase 5: Waiting {self.propagation_wait}s for propagation"
)
time.sleep(self.propagation_wait)
# Phase 6: REVOKE old key
logger.info(f"[{credential_id}] Phase 6: Revoking old key")
old_revoked = False
if old_key_id:
old_revoked = self.provider.revoke_key(old_key_id)
if not old_revoked:
logger.warning(
f"[{credential_id}] Failed to revoke old key {old_key_id}. "
"Manual cleanup required."
)
elapsed = time.monotonic() - start_time
return RotationResult(
credential_id=credential_id,
success=True,
old_key_revoked=old_revoked,
new_key_validated=True,
rollback_performed=False,
duration_seconds=elapsed,
)
except Exception as e:
elapsed = time.monotonic() - start_time
logger.exception(f"[{credential_id}] Rotation failed: {e}")
return RotationResult(
credential_id=credential_id,
success=False,
old_key_revoked=False,
new_key_validated=False,
rollback_performed=False,
duration_seconds=elapsed,
error=str(e),
)
def _read_secret(self, secret_arn: str) -> dict:
"""Read the current secret value from Secrets Manager."""
response = self.secrets_client.get_secret_value(SecretId=secret_arn)
return json.loads(response["SecretString"])
def _update_secret(self, secret_arn: str, new_value: dict) -> None:
"""Update the secret value in Secrets Manager."""
self.secrets_client.put_secret_value(
SecretId=secret_arn,
SecretString=json.dumps(new_value),
)Handling Long-Running AI Jobs
Fine-tuning jobs, batch evaluation pipelines, and continuous embedding indexers hold credentials for extended periods. Rotating credentials while these jobs are running requires careful coordination.
# secrets_rotation/job_aware_rotator.py
"""Job-aware rotation that coordinates with long-running AI workloads."""
from __future__ import annotations
import logging
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
logger = logging.getLogger(__name__)
@dataclass
class RunningJob:
"""A long-running AI job that holds credentials."""
job_id: str
job_type: str # fine_tuning, batch_inference, embedding_index
credential_ids: list[str] # Which credentials this job uses
started_at: datetime
estimated_completion: Optional[datetime]
can_be_interrupted: bool
checkpoint_available: bool # Can the job resume from a checkpoint?
class JobAwareRotationScheduler:
"""Schedules credential rotation around long-running AI jobs."""
def __init__(self, max_rotation_delay_hours: int = 48):
self.max_delay = timedelta(hours=max_rotation_delay_hours)
self._running_jobs: list[RunningJob] = []
def register_job(self, job: RunningJob):
"""Register a long-running job that holds credentials."""
self._running_jobs.append(job)
logger.info(
f"Registered job {job.job_id} holding credentials: {job.credential_ids}"
)
def unregister_job(self, job_id: str):
"""Remove a completed job from the tracker."""
self._running_jobs = [j for j in self._running_jobs if j.job_id != job_id]
def can_rotate_now(self, credential_id: str) -> dict:
"""
Check if a credential can be safely rotated right now.
Returns a decision dict with rotation guidance.
"""
blocking_jobs = [
j for j in self._running_jobs
if credential_id in j.credential_ids
]
if not blocking_jobs:
return {
"can_rotate": True,
"blocking_jobs": [],
"recommended_action": "Proceed with immediate rotation",
}
# Check if all blocking jobs can be interrupted safely
all_interruptible = all(j.can_be_interrupted for j in blocking_jobs)
all_checkpointed = all(j.checkpoint_available for j in blocking_jobs)
# Find the earliest completion time among blocking jobs
completions = [
j.estimated_completion for j in blocking_jobs
if j.estimated_completion
]
earliest_completion = min(completions) if completions else None
now = datetime.utcnow()
if all_interruptible and all_checkpointed:
return {
"can_rotate": True,
"blocking_jobs": [j.job_id for j in blocking_jobs],
"recommended_action": (
"Jobs are interruptible with checkpoints. "
"Interrupt jobs, rotate credentials, resume from checkpoint."
),
"pre_rotation_steps": [
f"Checkpoint job {j.job_id}" for j in blocking_jobs
],
}
if earliest_completion and (earliest_completion - now) < self.max_delay:
return {
"can_rotate": False,
"blocking_jobs": [j.job_id for j in blocking_jobs],
"recommended_action": (
f"Delay rotation until {earliest_completion.isoformat()}. "
f"Blocking jobs will complete within the allowed delay window."
),
"retry_after": earliest_completion.isoformat(),
}
# Jobs are running too long — must force rotation for security
return {
"can_rotate": True,
"blocking_jobs": [j.job_id for j in blocking_jobs],
"recommended_action": (
"FORCE ROTATION: Blocking jobs exceed maximum rotation delay. "
"Security policy requires rotation. Accept that jobs will fail."
),
"force_reason": (
f"Jobs have been running since "
f"{min(j.started_at for j in blocking_jobs).isoformat()} "
f"and exceed the {self.max_delay} maximum delay policy."
),
"post_rotation_steps": [
f"Restart job {j.job_id} with new credentials"
for j in blocking_jobs
],
}Monitoring and Alerting
# secrets_rotation/monitoring.py
"""Monitoring and alerting for AI credential rotation health."""
from __future__ import annotations
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Optional
@dataclass
class CredentialHealthCheck:
"""Health status of a managed credential."""
credential_id: str
name: str
is_valid: bool
days_since_rotation: int
rotation_overdue: bool
max_age_days: int
anomalous_usage: bool
usage_count_24h: int
usage_count_avg_24h: float
health_status: str # healthy, warning, critical, expired
@property
def urgency_score(self) -> float:
"""Calculate urgency for rotation prioritization (0-100)."""
score = 0
if self.rotation_overdue:
overdue_days = self.days_since_rotation - self.max_age_days
score += min(50, overdue_days * 5)
if self.anomalous_usage:
score += 30
if not self.is_valid:
score = 100 # Invalid credential is maximum urgency
return min(100, score)
class RotationHealthMonitor:
"""Monitors the health of all managed AI credentials."""
def __init__(self, alert_callback=None):
self.alert_callback = alert_callback
self._health_history: list[dict] = []
def check_all(
self,
credentials: list[dict],
current_time: Optional[datetime] = None,
) -> list[CredentialHealthCheck]:
"""
Run health checks on all managed credentials.
Args:
credentials: List of credential metadata dicts with keys:
credential_id, name, last_rotated, rotation_interval_days,
is_valid, usage_count_24h, usage_count_avg_24h.
current_time: Override for testing. Defaults to now.
Returns:
List of health check results, sorted by urgency.
"""
now = current_time or datetime.utcnow()
results = []
for cred in credentials:
last_rotated = cred.get("last_rotated")
if isinstance(last_rotated, str):
last_rotated = datetime.fromisoformat(last_rotated)
days_since = (now - last_rotated).days if last_rotated else 999
max_age = cred.get("rotation_interval_days", 90)
overdue = days_since > max_age
# Anomaly detection: usage more than 3x the average
usage_24h = cred.get("usage_count_24h", 0)
usage_avg = cred.get("usage_count_avg_24h", 1)
anomalous = usage_24h > (usage_avg * 3) and usage_24h > 100
# Determine overall health
is_valid = cred.get("is_valid", True)
if not is_valid:
status = "critical"
elif overdue and days_since > max_age * 1.5:
status = "critical"
elif overdue or anomalous:
status = "warning"
else:
status = "healthy"
check = CredentialHealthCheck(
credential_id=cred["credential_id"],
name=cred["name"],
is_valid=is_valid,
days_since_rotation=days_since,
rotation_overdue=overdue,
max_age_days=max_age,
anomalous_usage=anomalous,
usage_count_24h=usage_24h,
usage_count_avg_24h=usage_avg,
health_status=status,
)
results.append(check)
# Fire alerts for non-healthy credentials
if status in ("warning", "critical") and self.alert_callback:
self.alert_callback(check)
# Sort by urgency (most urgent first)
results.sort(key=lambda c: c.urgency_score, reverse=True)
return results
def generate_dashboard_data(
self, checks: list[CredentialHealthCheck]
) -> dict:
"""Generate dashboard-ready summary data."""
total = len(checks)
healthy = sum(1 for c in checks if c.health_status == "healthy")
warning = sum(1 for c in checks if c.health_status == "warning")
critical = sum(1 for c in checks if c.health_status == "critical")
return {
"total_credentials": total,
"healthy": healthy,
"warning": warning,
"critical": critical,
"health_rate": healthy / total if total > 0 else 0,
"overdue_rotations": sum(1 for c in checks if c.rotation_overdue),
"anomalous_usage": sum(1 for c in checks if c.anomalous_usage),
"top_urgency": [
{
"credential_id": c.credential_id,
"name": c.name,
"status": c.health_status,
"urgency_score": c.urgency_score,
"days_since_rotation": c.days_since_rotation,
}
for c in checks[:5] # Top 5 most urgent
],
}Rotation Schedules by Credential Type
| Credential Type | Recommended Interval | Rotation Pattern | Special Considerations |
|---|---|---|---|
| Model provider API keys (OpenAI, Anthropic) | 90 days | Dual-key | Supports multiple active keys; rotate without downtime |
| Cloud AI service keys (Bedrock, Azure OpenAI) | Use IAM roles instead | Role-based | Prefer temporary credentials via STS/managed identities |
| Vector database tokens (Pinecone, Weaviate) | 90 days | Blue-green | May require client restart; schedule during maintenance |
| Fine-tuning service accounts | 180 days | Job-aware | Schedule between fine-tuning runs |
| MCP server tokens | 30 days | Dual-key | Short-lived tokens preferred; rotate frequently |
| Monitoring API keys | 180 days | Simple swap | Low blast radius; can tolerate brief monitoring gaps |
| OAuth client secrets | 365 days | Dual-client | Register new client, migrate, decommission old client |
AI-Specific Rotation Challenges
Credential Caching in Model Serving Containers
Cloud AI inference containers — whether ECS tasks, Kubernetes pods, or serverless functions — often cache credentials at startup for performance reasons. When a credential is rotated in the secrets manager, the running container continues using the old credential until it is restarted or explicitly refreshed. This creates a window where the old credential must remain valid.
To handle this correctly, implement a credential refresh mechanism in your inference layer that periodically checks the secrets manager for updated values. Set the refresh interval shorter than the propagation wait in your rotation procedure. For example, if your rotation procedure waits 5 minutes for propagation, configure containers to refresh credentials every 3 minutes. This ensures all containers pick up the new credential before the old one is revoked.
For serverless AI inference (AWS Lambda, Azure Functions), the cold start mechanism naturally picks up new credentials from the environment. However, warm instances may cache credentials for the lifetime of the execution environment, which can persist for hours. Force credential refresh by implementing an in-process TTL on cached credentials.
Multi-Region Deployments
Organizations that deploy AI services across multiple cloud regions face additional complexity. A credential rotated in the primary region must propagate to all secondary regions before the old credential is revoked. Secrets Manager replication (available in all major clouds) handles the storage propagation, but the application-level refresh window must account for cross-region replication latency.
For critical AI credentials, implement a multi-region rotation workflow:
- Create the new credential in the primary region
- Wait for cross-region replication to complete (verify with a read from each secondary region)
- Wait for application containers in all regions to refresh
- Revoke the old credential
The total propagation window for multi-region deployments is typically 10-15 minutes, compared to 5 minutes for single-region deployments.
Agent and MCP Server Credentials
AI agent architectures that use the Model Context Protocol (MCP) or similar tool-use frameworks introduce credential chains. The agent itself has a credential to access the LLM, each MCP server has its own authentication token, and each MCP server may hold credentials for downstream services (databases, APIs, file systems). Rotating the agent's LLM credential is straightforward, but rotating an MCP server token may require restarting the agent's connection to that server, which interrupts any in-progress tool calls.
For MCP server credentials, implement a graceful rotation pattern:
- Create a new MCP server instance with the new credentials
- Configure the agent to connect to both old and new instances
- Drain in-progress requests from the old instance
- Remove the old instance from the agent's server configuration
- Revoke the old credentials and decommission the old instance
This blue-green pattern for MCP servers avoids interrupting agent workflows during rotation.
Emergency Rotation Procedures
When a credential is suspected of being compromised, the standard rotation procedure is too slow. Emergency rotation must prioritize revocation speed over zero-downtime guarantees. The emergency rotation procedure:
- Immediately create a new credential
- Immediately revoke the compromised credential (accept downtime)
- Update the secrets manager with the new credential
- Force-restart all dependent services to pick up the new credential
- Monitor for failures and address them reactively
This procedure prioritizes security over availability. The acceptable downtime window for emergency rotation should be documented in advance and agreed upon with application teams. For most AI applications, a 2-5 minute disruption during an emergency credential rotation is preferable to continued exposure of a compromised credential.
Testing Your Rotation Procedures
Rotation procedures that are not regularly tested will fail when they are needed most. Implement the following testing cadence:
-
Monthly dry runs: Execute the rotation procedure with
dry_run=Trueto verify that all automation steps complete without errors. -
Quarterly live rotations: Perform actual credential rotations in a staging environment that mirrors production. Verify zero-downtime behavior and measure propagation timing.
-
Annual emergency drill: Simulate a credential compromise scenario and execute the emergency rotation procedure end-to-end. Measure time-to-revocation and time-to-recovery.
-
Chaos testing: Inject failures into the rotation process (simulated secrets manager outage, network partition during propagation) to verify that rollback procedures work correctly.
Common Failure Modes
Understanding how rotation procedures fail helps design more resilient automation:
Partial rotation: The new key is created and stored but the old key is never revoked, leaving two active keys. This is the most benign failure mode but creates credential sprawl over time. Implement a reconciliation job that identifies credentials with multiple active keys and alerts the security team.
Premature revocation: The old key is revoked before all consumers pick up the new key, causing service outages. This typically happens when the propagation wait is too short or when a container has a longer credential cache TTL than expected. The fix is to increase the propagation wait and implement pre-revocation validation that checks all known consumers are using the new key.
Rotation loop: Automated rotation triggers an alert, which triggers a response, which triggers another rotation. This can happen when the monitoring system flags the expected post-rotation usage pattern change as anomalous. Configure alert suppression windows around scheduled rotations to prevent this cycle.
Secrets manager outage during rotation: If the secrets manager becomes unavailable between creating the new key and storing it, the new key exists at the provider but is not recorded anywhere. The automation must store the new key details in a fallback location (encrypted local file, separate secrets store) before creating it, enabling recovery from this state.
References
- AWS. "Rotate AWS Secrets Manager Secrets," https://docs.aws.amazon.com/secretsmanager/latest/userguide/rotating-secrets.html
- Microsoft. "Key Vault Secrets Rotation," https://learn.microsoft.com/en-us/azure/key-vault/secrets/tutorial-rotation
- NIST SP 800-57, "Recommendation for Key Management," https://csrc.nist.gov/publications/detail/sp/800-57-part-1/rev-5/final
- OpenAI. "API Key Best Practices," https://platform.openai.com/docs/guides/safety-best-practices