Zero Trust Architecture for AI Infrastructure
Implementing and attacking zero trust principles across ML training pipelines, inference endpoints, and model registries
Overview
Zero trust architecture (ZTA) operates on the principle that no network location, user, or system should be inherently trusted. Every access request must be authenticated, authorized, and continuously validated regardless of where it originates. While zero trust has been widely adopted for traditional enterprise infrastructure, its application to AI systems introduces unique challenges that create gaps attackers can exploit.
AI infrastructure has characteristics that strain zero trust implementations. Training clusters require high-bandwidth, low-latency GPU-to-GPU communication (often via RDMA/InfiniBand) that is difficult to intercept and inspect without introducing unacceptable performance overhead. Model artifacts are large (hundreds of gigabytes for modern LLMs) and must be transferred between registries, training systems, and serving infrastructure — creating pressure to bypass security controls for performance. Feature stores, experiment trackers, and data pipelines often use service-to-service authentication with long-lived credentials because the overhead of token rotation is seen as impractical for long-running training jobs.
Inference endpoints must respond in milliseconds, making per-request authorization checks a performance concern.
These tensions between security and performance create predictable gaps in zero trust implementations that red teamers can identify and exploit. This article examines how to apply zero trust principles to AI infrastructure, where implementations typically fall short, and how attackers target those gaps. The content aligns with NIST SP 800-207 (Zero Trust Architecture) and NIST AI RMF for AI-specific risk considerations.
Zero Trust Principles Applied to AI Infrastructure
Identity: Every Component Gets a Verifiable Identity
In a zero trust AI infrastructure, every component — from training jobs to inference endpoints to data pipelines — must have a cryptographically verifiable identity. This goes beyond user authentication to include workload identity for automated processes.
SPIFFE (Secure Production Identity Framework For Everyone) provides a standard for workload identity that is well-suited to AI infrastructure. Each workload receives a SPIFFE Verifiable Identity Document (SVID), typically an X.509 certificate, that encodes its identity as a URI (e.g., spiffe://ai-platform/training/job-12345).
"""
SPIFFE-based workload identity verification for AI pipeline components.
Demonstrates how to verify that a training job, model registry, or
inference endpoint has a valid identity before allowing access.
"""
import ssl
import socket
import json
from dataclasses import dataclass
from typing import Optional
from urllib.parse import urlparse
from cryptography import x509
from cryptography.x509.oid import ExtensionOID, NameOID
from cryptography.hazmat.primitives import hashes
@dataclass
class WorkloadIdentity:
"""Parsed SPIFFE identity from an X.509 SVID."""
spiffe_id: str
trust_domain: str
workload_path: str
certificate_hash: str
not_valid_after: str
@property
def component_type(self) -> str:
"""Extract AI component type from SPIFFE path."""
parts = self.workload_path.strip("/").split("/")
if parts:
return parts[0] # e.g., "training", "inference", "registry"
return "unknown"
def extract_spiffe_id(cert: x509.Certificate) -> Optional[WorkloadIdentity]:
"""
Extract SPIFFE ID from X.509 certificate SAN extension.
SPIFFE IDs are encoded as URI SANs in the format:
spiffe://<trust-domain>/<workload-path>
"""
try:
san_ext = cert.extensions.get_extension_for_oid(
ExtensionOID.SUBJECT_ALTERNATIVE_NAME
)
san = san_ext.value
for uri in san.get_values_for_type(x509.UniformResourceIdentifier):
if uri.startswith("spiffe://"):
parsed = urlparse(uri)
cert_hash = cert.fingerprint(hashes.SHA256()).hex()
return WorkloadIdentity(
spiffe_id=uri,
trust_domain=parsed.hostname or "",
workload_path=parsed.path,
certificate_hash=cert_hash,
not_valid_after=str(cert.not_valid_after_utc),
)
except x509.ExtensionNotFound:
return None
return None
class AIZeroTrustVerifier:
"""
Verify workload identity and enforce access policies
for AI infrastructure components.
"""
# Access control matrix: which components can access which
ACCESS_POLICIES = {
"training": {
"allowed_targets": [
"data-store",
"registry",
"experiment-tracker",
"feature-store",
],
"denied_targets": ["inference", "monitoring-admin"],
},
"inference": {
"allowed_targets": [
"registry", # Read-only for model loading
"feature-store", # For feature retrieval
],
"denied_targets": [
"training",
"data-store", # Inference should not access raw training data
],
},
"registry": {
"allowed_targets": ["data-store"],
"denied_targets": ["training", "inference"],
},
"pipeline": {
"allowed_targets": [
"training",
"registry",
"data-store",
"feature-store",
],
"denied_targets": ["inference"],
},
}
def __init__(self, trust_domain: str):
self.trust_domain = trust_domain
def verify_access(
self,
source: WorkloadIdentity,
target_component: str,
) -> tuple[bool, str]:
"""
Verify if a source workload is allowed to access a target component.
Returns (allowed, reason).
"""
# Verify trust domain
if source.trust_domain != self.trust_domain:
return False, (
f"Trust domain mismatch: {source.trust_domain} "
f"!= {self.trust_domain}"
)
# Look up policy for source component type
policy = self.ACCESS_POLICIES.get(source.component_type)
if policy is None:
return False, (
f"No policy defined for component type: "
f"{source.component_type}"
)
if target_component in policy.get("denied_targets", []):
return False, (
f"{source.component_type} is explicitly denied "
f"access to {target_component}"
)
if target_component in policy.get("allowed_targets", []):
return True, "Access permitted by policy"
# Default deny
return False, (
f"No explicit allow for {source.component_type} -> "
f"{target_component}"
)
def audit_connection(
self,
peer_cert_pem: bytes,
target_component: str,
) -> dict:
"""
Full audit of an incoming connection:
1. Parse certificate
2. Extract SPIFFE identity
3. Check access policy
"""
cert = x509.load_pem_x509_certificate(peer_cert_pem)
identity = extract_spiffe_id(cert)
if identity is None:
return {
"allowed": False,
"reason": "No SPIFFE ID in certificate",
"identity": None,
}
allowed, reason = self.verify_access(identity, target_component)
return {
"allowed": allowed,
"reason": reason,
"identity": {
"spiffe_id": identity.spiffe_id,
"component_type": identity.component_type,
"cert_hash": identity.certificate_hash,
},
}Microsegmentation for AI Networks
AI training clusters typically use high-speed interconnects (InfiniBand, RoCE) for GPU-to-GPU communication during distributed training. These networks are often treated as trusted because of the performance sensitivity of collective operations (AllReduce, AllGather). This creates a significant blind spot in zero trust implementations.
The InfiniBand trust gap: InfiniBand networks used in GPU clusters do not support the same network policy enforcement available in Ethernet-based Kubernetes networks. Tools like Calico and Cilium can enforce microsegmentation for pod-to-pod Ethernet traffic, but InfiniBand traffic bypasses the kernel networking stack entirely through RDMA, making it invisible to eBPF-based network policies.
"""
Audit script for identifying zero trust gaps in AI infrastructure
network segmentation, with focus on GPU cluster interconnects.
"""
import subprocess
import json
import re
from typing import Optional
def audit_kubernetes_network_policies(namespace: str = "ml-platform") -> list[dict]:
"""
Audit Kubernetes network policies for AI workload namespaces.
Identifies missing policies that would allow unrestricted
communication between components.
"""
findings = []
# Get all pods in the namespace
result = subprocess.run(
["kubectl", "get", "pods", "-n", namespace, "-o", "json"],
capture_output=True, text=True, timeout=30,
)
pods = json.loads(result.stdout)
# Get network policies
result = subprocess.run(
["kubectl", "get", "networkpolicies", "-n", namespace, "-o", "json"],
capture_output=True, text=True, timeout=30,
)
policies = json.loads(result.stdout)
# Check if default-deny exists
has_default_deny = any(
policy["metadata"]["name"].startswith("default-deny")
for policy in policies.get("items", [])
)
if not has_default_deny:
findings.append({
"severity": "HIGH",
"title": f"No default-deny policy in namespace {namespace}",
"detail": (
"Without a default-deny ingress/egress policy, all pods "
"can communicate freely. AI components (training, inference, "
"registry) should be isolated by default."
),
})
# Check for pods with host networking
for pod in pods.get("items", []):
pod_name = pod["metadata"]["name"]
spec = pod.get("spec", {})
if spec.get("hostNetwork", False):
findings.append({
"severity": "HIGH",
"title": f"Pod {pod_name} uses host networking",
"detail": (
"Host networking bypasses all Kubernetes network "
"policies. This pod has unrestricted network access "
"to the node and potentially the InfiniBand fabric."
),
})
# Check for privileged containers (common for GPU workloads)
for container in spec.get("containers", []):
sec_ctx = container.get("securityContext", {})
if sec_ctx.get("privileged", False):
findings.append({
"severity": "HIGH",
"title": (
f"Privileged container: {pod_name}/"
f"{container['name']}"
),
"detail": (
"Privileged containers can access all host "
"devices including InfiniBand HCAs, bypass "
"network namespaces, and escape container "
"isolation."
),
})
# Check for RDMA/InfiniBand device access
for pod in pods.get("items", []):
pod_name = pod["metadata"]["name"]
for container in pod.get("spec", {}).get("containers", []):
resources = container.get("resources", {})
limits = resources.get("limits", {})
requests = resources.get("requests", {})
all_resources = {**limits, **requests}
for resource_name in all_resources:
if "rdma" in resource_name or "infiniband" in resource_name:
findings.append({
"severity": "MEDIUM",
"title": (
f"RDMA device access: {pod_name}/"
f"{container['name']}"
),
"detail": (
f"Container requests {resource_name}. RDMA "
f"traffic bypasses kernel networking and is "
f"not subject to NetworkPolicy enforcement."
),
})
return findings
def check_service_mesh_coverage(namespace: str = "ml-platform") -> list[dict]:
"""
Verify that a service mesh (Istio/Linkerd) covers AI workloads
and that mTLS is enforced.
"""
findings = []
# Check for Istio sidecar injection
result = subprocess.run(
[
"kubectl", "get", "pods", "-n", namespace,
"-o", "jsonpath={range .items[*]}{.metadata.name}{"
"\\t}{.spec.containers[*].name}{\\n}{end}",
],
capture_output=True, text=True, timeout=30,
)
for line in result.stdout.strip().split("\n"):
if not line.strip():
continue
parts = line.split("\t")
if len(parts) < 2:
continue
pod_name = parts[0]
containers = parts[1].split()
has_sidecar = any(
c in containers
for c in ["istio-proxy", "linkerd-proxy", "envoy-sidecar"]
)
if not has_sidecar:
findings.append({
"severity": "MEDIUM",
"title": f"No service mesh sidecar: {pod_name}",
"detail": (
"This pod communicates without mTLS enforcement. "
"Traffic can be intercepted or spoofed by adjacent "
"workloads."
),
})
# Check Istio PeerAuthentication policy
result = subprocess.run(
[
"kubectl", "get", "peerauthentication", "-n", namespace,
"-o", "json",
],
capture_output=True, text=True, timeout=30,
)
if result.returncode == 0:
pa_policies = json.loads(result.stdout)
strict_mtls = any(
policy.get("spec", {}).get("mtls", {}).get("mode") == "STRICT"
for policy in pa_policies.get("items", [])
)
if not strict_mtls:
findings.append({
"severity": "HIGH",
"title": "mTLS not set to STRICT mode",
"detail": (
"PERMISSIVE mTLS allows plaintext connections. "
"An attacker in the mesh can intercept inference "
"requests, model weights, and training data."
),
})
return findingsAttacking Zero Trust Gaps in AI Pipelines
Exploiting Implicit Trust Between Pipeline Stages
ML pipelines (built with tools like Kubeflow, Airflow, or custom systems) often establish trust between stages implicitly. A training stage produces a model artifact that the evaluation stage consumes, and the evaluation stage's approval triggers deployment. If the pipeline trusts artifacts from previous stages without verification, an attacker who compromises any single stage can propagate through the entire pipeline.
"""
Demonstrate trust boundary violations in ML pipelines.
This script identifies pipeline stages that accept artifacts
from upstream stages without integrity verification.
"""
import hashlib
import json
import os
from pathlib import Path
from dataclasses import dataclass
from typing import Optional
@dataclass
class PipelineArtifact:
"""Represents an artifact passed between pipeline stages."""
stage_name: str
artifact_path: str
expected_hash: Optional[str]
actual_hash: Optional[str]
is_signed: bool
signature_valid: Optional[bool]
def audit_pipeline_artifacts(
pipeline_run_dir: str,
) -> list[dict]:
"""
Audit artifacts in a pipeline run directory for
integrity verification gaps.
"""
findings = []
run_path = Path(pipeline_run_dir)
if not run_path.exists():
return [{"severity": "ERROR", "title": "Pipeline run directory not found",
"detail": f"{pipeline_run_dir} does not exist"}]
# Look for common pipeline metadata files
metadata_files = list(run_path.rglob("**/metadata.json")) + \
list(run_path.rglob("**/artifact_info.json"))
for meta_file in metadata_files:
try:
with open(meta_file) as f:
metadata = json.load(f)
except (json.JSONDecodeError, IOError):
continue
stage_name = metadata.get("stage", meta_file.parent.name)
artifacts = metadata.get("output_artifacts", [])
for artifact in artifacts:
art_path = artifact.get("path", "")
has_hash = "sha256" in artifact or "hash" in artifact
has_signature = "signature" in artifact
if not has_hash:
findings.append({
"severity": "HIGH",
"title": f"No integrity hash: {stage_name}/{art_path}",
"detail": (
f"Artifact from stage '{stage_name}' has no hash. "
f"A compromised upstream stage could substitute "
f"a malicious artifact (e.g., poisoned model weights)."
),
})
if not has_signature:
findings.append({
"severity": "MEDIUM",
"title": f"No signature: {stage_name}/{art_path}",
"detail": (
f"Artifact is not cryptographically signed. "
f"Even with a hash, the hash itself could be "
f"modified by a compromised pipeline controller."
),
})
# Verify hash if present
if has_hash:
expected = artifact.get("sha256") or artifact.get("hash")
full_path = run_path / art_path
if full_path.exists():
actual = hashlib.sha256(
full_path.read_bytes()
).hexdigest()
if actual != expected:
findings.append({
"severity": "CRITICAL",
"title": (
f"Hash mismatch: {stage_name}/{art_path}"
),
"detail": (
f"Expected {expected}, got {actual}. "
f"Artifact may have been tampered with."
),
})
# Check for credential passing between stages
env_files = list(run_path.rglob("**/.env")) + \
list(run_path.rglob("**/secrets.*"))
for env_file in env_files:
findings.append({
"severity": "HIGH",
"title": f"Credentials in pipeline artifacts: {env_file}",
"detail": (
"Secrets stored in pipeline artifacts can be accessed "
"by downstream stages and persisted in artifact storage."
),
})
return findingsToken and Credential Attacks
Long-running training jobs often use service account tokens or API keys with extended validity. In zero trust architectures, these should be short-lived and continuously validated. Common gaps include:
- Static service account tokens in Kubernetes that do not expire (pre-v1.24 default behavior)
- Cloud IAM roles with overly broad permissions attached to training node pools
- Model registry credentials embedded in pipeline configurations
- Experiment tracking API keys shared across all team members
An attacker who obtains a training job's credentials gains access to everything that training job can access: training data, the model registry, experiment tracking, and potentially other cloud resources through role chaining or federation.
"""
Credential exposure analysis for AI workloads in Kubernetes.
Identifies overly broad credentials, long-lived tokens, and
credential sharing patterns that violate zero trust principles.
"""
import subprocess
import json
import base64
from typing import Any
def audit_ai_credentials(namespace: str = "ml-platform") -> list[dict]:
"""
Audit credentials available to AI workloads for zero trust
violations: excessive scope, long validity, and sharing.
"""
findings = []
# Get all service accounts in the namespace
result = subprocess.run(
["kubectl", "get", "serviceaccounts", "-n", namespace, "-o", "json"],
capture_output=True, text=True, timeout=30,
)
if result.returncode != 0:
return findings
service_accounts = json.loads(result.stdout)
for sa in service_accounts.get("items", []):
sa_name = sa["metadata"]["name"]
# Check for mounted secrets
secrets = sa.get("secrets", [])
if len(secrets) > 0:
findings.append({
"severity": "MEDIUM",
"title": f"Service account has bound secrets: {sa_name}",
"detail": (
f"SA {sa_name} has {len(secrets)} bound secrets. "
f"In zero trust, prefer projected service account tokens "
f"with expiration over static secrets."
),
})
# Check annotations for cloud IAM bindings
annotations = sa.get("metadata", {}).get("annotations", {})
# GKE Workload Identity
gke_sa = annotations.get(
"iam.gke.io/gcp-service-account", ""
)
if gke_sa:
findings.append({
"severity": "INFO",
"title": f"GKE Workload Identity binding: {sa_name}",
"detail": f"Bound to GCP SA: {gke_sa}. Verify scope is minimal.",
})
# EKS IRSA
eks_role = annotations.get(
"eks.amazonaws.com/role-arn", ""
)
if eks_role:
findings.append({
"severity": "INFO",
"title": f"EKS IRSA binding: {sa_name}",
"detail": f"Bound to IAM role: {eks_role}. Verify role policy scope.",
})
# Check for pods with environment variable credentials
pods_result = subprocess.run(
["kubectl", "get", "pods", "-n", namespace, "-o", "json"],
capture_output=True, text=True, timeout=30,
)
if pods_result.returncode == 0:
pods = json.loads(pods_result.stdout)
sensitive_env_patterns = [
"KEY", "SECRET", "PASSWORD", "TOKEN", "CREDENTIAL",
"API_KEY", "ACCESS_KEY", "PRIVATE_KEY",
]
for pod in pods.get("items", []):
pod_name = pod["metadata"]["name"]
for container in pod.get("spec", {}).get("containers", []):
for env in container.get("env", []):
env_name = env.get("name", "").upper()
if any(p in env_name for p in sensitive_env_patterns):
# Check if it's from a secret reference (better)
# or a plaintext value (worse)
if "value" in env and env["value"]:
findings.append({
"severity": "CRITICAL",
"title": (
f"Hardcoded credential: {pod_name} "
f"env {env['name']}"
),
"detail": (
"Credential is hardcoded in pod spec "
"as a plaintext value. Use Kubernetes "
"secrets with projected volumes or "
"external secret managers."
),
})
elif "valueFrom" in env:
source = env["valueFrom"]
if "secretKeyRef" in source:
findings.append({
"severity": "LOW",
"title": (
f"Secret-backed credential: "
f"{pod_name} env {env['name']}"
),
"detail": (
f"From secret: "
f"{source['secretKeyRef'].get('name')}. "
f"Verify rotation policy."
),
})
return findingsContinuous Verification and Device Posture
Zero trust architectures require continuous verification — not just authenticating once at connection time. For AI workloads, this means:
- Runtime integrity checking: Verify that the training script, model handler, or serving binary has not been modified since deployment. Container image digests should be verified at pod admission and periodically at runtime.
- Node attestation: GPU nodes should attest their integrity before being trusted with sensitive model weights or training data. Hardware-based attestation (TPM, TEE attestation) provides stronger guarantees than software-only checks.
- Behavioral monitoring: Continuously monitor AI workload behavior for anomalies. A training job that suddenly begins making outbound network connections it has never made before, or an inference endpoint whose response latency distribution changes dramatically, may be compromised.
- Token refresh under policy re-evaluation: When credentials are refreshed, the authorization decision should be re-evaluated against current policy. This ensures that policy changes (such as revoking a team's access to a model) take effect within the token lifetime.
Practical Examples
Zero Trust Compliance Checker for AI Platforms
#!/usr/bin/env bash
# Zero trust compliance audit for AI infrastructure on Kubernetes
# Checks for common violations of zero trust principles
set -euo pipefail
NAMESPACE="${1:-ml-platform}"
echo "=== Zero Trust Audit: Namespace $NAMESPACE ==="
echo ""
echo "--- 1. Default Deny Network Policies ---"
DENY_POLICIES=$(kubectl get networkpolicies -n "$NAMESPACE" \
-o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' 2>/dev/null \
| grep -c "default-deny" || true)
if [ "$DENY_POLICIES" -eq 0 ]; then
echo "[FAIL] No default-deny network policy found"
else
echo "[PASS] Default-deny policy exists"
fi
echo ""
echo "--- 2. Service Mesh mTLS ---"
STRICT_MTLS=$(kubectl get peerauthentication -n "$NAMESPACE" \
-o jsonpath='{range .items[*]}{.spec.mtls.mode}{"\n"}{end}' 2>/dev/null \
| grep -c "STRICT" || true)
if [ "$STRICT_MTLS" -eq 0 ]; then
echo "[FAIL] No STRICT mTLS PeerAuthentication policy"
else
echo "[PASS] STRICT mTLS enforced"
fi
echo ""
echo "--- 3. Service Account Token Projection ---"
# Check for pods using legacy non-expiring tokens
LEGACY_TOKENS=$(kubectl get pods -n "$NAMESPACE" -o json 2>/dev/null \
| python3 -c "
import json, sys
data = json.load(sys.stdin)
count = 0
for pod in data.get('items', []):
for vol in pod.get('spec', {}).get('volumes', []):
if 'secret' in vol and 'token' in vol.get('secret', {}).get('secretName', '').lower():
count += 1
print(f' Legacy token: {pod[\"metadata\"][\"name\"]}')
print(f'Total: {count}')
" 2>/dev/null)
echo "$LEGACY_TOKENS"
echo ""
echo "--- 4. Privileged Containers (Zero Trust Violation) ---"
kubectl get pods -n "$NAMESPACE" -o json 2>/dev/null \
| python3 -c "
import json, sys
data = json.load(sys.stdin)
for pod in data.get('items', []):
for c in pod.get('spec', {}).get('containers', []):
sc = c.get('securityContext', {})
if sc.get('privileged'):
print(f' [FAIL] {pod[\"metadata\"][\"name\"]}/{c[\"name\"]} is privileged')
if sc.get('runAsUser') == 0:
print(f' [WARN] {pod[\"metadata\"][\"name\"]}/{c[\"name\"]} runs as root')
" 2>/dev/null
echo ""
echo "--- 5. External Access Points ---"
echo "Services with LoadBalancer or NodePort (exposed externally):"
kubectl get svc -n "$NAMESPACE" -o json 2>/dev/null \
| python3 -c "
import json, sys
data = json.load(sys.stdin)
for svc in data.get('items', []):
svc_type = svc.get('spec', {}).get('type', 'ClusterIP')
if svc_type in ('LoadBalancer', 'NodePort'):
name = svc['metadata']['name']
ports = svc['spec'].get('ports', [])
port_str = ', '.join(str(p.get('port', '?')) for p in ports)
print(f' [WARN] {name} ({svc_type}): ports {port_str}')
" 2>/dev/null
echo ""
echo "=== Audit Complete ==="Defense and Mitigation
Implement workload identity everywhere: Use SPIFFE/SPIRE or cloud-native workload identity (GKE Workload Identity, EKS IRSA) for all AI components. Eliminate static credentials and service account keys. Training jobs should use short-lived tokens that are rotated automatically.
Enforce mTLS for all service-to-service communication: Deploy a service mesh (Istio, Linkerd) in STRICT mTLS mode. For high-performance training networks using RDMA/InfiniBand, implement application-layer encryption where kernel-bypass networking prevents mesh-level enforcement.
Apply default-deny network policies: Every AI namespace should have a default-deny ingress and egress policy. Explicitly allow only required communication paths: training to data store, inference to model registry, pipeline controller to individual stages.
Verify artifacts at every boundary: Every pipeline stage should verify the integrity of incoming artifacts using cryptographic signatures, not just hashes. Use tools like Sigstore/cosign for model artifact signing and verification.
Short-lived credentials with continuous validation: Training jobs should receive credentials that expire before the job completes, requiring renewal through a token exchange that re-evaluates authorization. This limits the blast radius of credential theft.
Monitor and alert on policy violations: Implement continuous compliance monitoring that detects network policy changes, new privileged workloads, service mesh bypass, and credential anomalies. Integrate with SIEM for correlation with other security events.
Implement data-level zero trust: Beyond network and identity, apply zero trust principles to data itself. Training data should carry metadata about its provenance and integrity. Model artifacts should be signed and verified at every loading point. Inference inputs and outputs should be validated against expected schemas and distributions. This data-level zero trust approach catches attacks that bypass network controls, such as data poisoning through legitimate pipeline components.
Segment by sensitivity level: Not all AI workloads require the same security posture. A model that classifies product images has different risk than one that processes medical records. Implement tiered zero trust zones where the strictest controls (hardware attestation, encrypted inference, fully isolated networks) are reserved for the most sensitive AI workloads, while less sensitive workloads use lighter controls. This prevents the performance overhead of maximum security from becoming a barrier to adoption.
Regularly test zero trust controls: Zero trust architectures degrade over time as exceptions accumulate, configurations drift, and new components are added without proper integration. Schedule regular penetration testing specifically targeting zero trust boundaries — attempt lateral movement from training to inference, from one tenant to another, from a compromised container to the host, and from an internal position to external data exfiltration. Each test validates that the controls actually work, not just that they are configured.
References
- Rose, S., Borchert, O., Mitchell, S., & Connelly, S. (2020). "Zero Trust Architecture." NIST Special Publication 800-207. https://doi.org/10.6028/NIST.SP.800-207
- SPIFFE. (2024). "Secure Production Identity Framework for Everyone." https://spiffe.io/docs/latest/spiffe-about/overview/
- NIST. (2023). "Artificial Intelligence Risk Management Framework (AI RMF 1.0)." https://www.nist.gov/artificial-intelligence/executive-order-safe-secure-and-trustworthy-artificial-intelligence
- MITRE ATLAS. "Techniques: ML Supply Chain Compromise." https://atlas.mitre.org/techniques/AML.T0010
- Google. (2024). "BeyondCorp: A New Approach to Enterprise Security." https://cloud.google.com/beyondcorp
- Kubernetes. (2024). "Network Policies." https://kubernetes.io/docs/concepts/services-networking/network-policies/