AI Deployment Patterns and Security Implications
How API-based, self-hosted, edge, and hybrid deployment patterns each create distinct security considerations and attack surfaces for AI systems.
Deployment Determines Attack Surface
The same AI model deployed in different ways presents fundamentally different security profiles. A model accessible only through a rate-limited API with server-side guardrails is a very different target than the same model running locally on a user's laptop with no external controls. Understanding deployment patterns is essential for scoping red team engagements and prioritizing attacks.
Pattern 1: API-Based Deployment
The model runs on the provider's infrastructure and is accessed through a REST API. This is the most common pattern for commercial AI services.
Architecture
┌──────────┐ HTTPS ┌──────────────────────────┐
│ Client │ ──────────────→ │ API Gateway │
│ App │ │ ├─ Authentication │
│ │ ←────────────── │ ├─ Rate Limiting │
└──────────┘ │ ├─ Input Guardrails │
│ ├─ Model Inference │
│ ├─ Output Guardrails │
│ └─ Logging/Monitoring │
└──────────────────────────┘
Security Characteristics
| Aspect | Assessment |
|---|---|
| Authentication | API keys, OAuth tokens — provider controls access |
| Rate limiting | Server-side enforcement — provider can throttle abuse |
| Guardrails | Server-side — consistent enforcement across all clients |
| Monitoring | Full visibility into all requests and responses |
| Model access | Weights are not exposed — extraction requires many queries |
| Update control | Provider can patch vulnerabilities centrally |
Attack Surface
- API key theft or leakage: Keys embedded in client-side code, version control, or logs
- Rate limit bypass: Distributed requests, key rotation, endpoint multiplexing
- Authentication flaws: Insufficient key scoping, missing key rotation, overly permissive CORS
- Input/output interception: Man-in-the-middle attacks if TLS is not properly implemented
- Prompt injection: Remains possible through the API regardless of server-side controls
- Side-channel information: Response timing, token counts, and error messages can reveal information about guardrails and model behavior
Pattern 2: Self-Hosted Deployment
The organization runs the model on its own infrastructure (on-premises servers, private cloud, or dedicated cloud instances).
Architecture
┌──────────┐ ┌──────────────────────────┐
│ Client │ ──────────────→ │ Organization's Infra │
│ App │ │ ├─ Load Balancer │
│ │ ←────────────── │ ├─ API Layer │
└──────────┘ │ ├─ Custom Guardrails │
│ ├─ Inference Engine │
│ │ (vLLM, TGI, etc.) │
│ ├─ Model Weights │
│ └─ Custom Monitoring │
└──────────────────────────┘
Security Characteristics
| Aspect | Assessment |
|---|---|
| Authentication | Organization-defined — quality varies widely |
| Rate limiting | Must be implemented by the organization |
| Guardrails | Must be built or integrated by the organization |
| Monitoring | Must be configured by the organization |
| Model access | Weights are on the organization's infrastructure — insider threat risk |
| Update control | Organization controls update pace — may lag behind patches |
Attack Surface
Self-hosted deployments inherit all the attack surface of API-based deployments plus:
- Infrastructure vulnerabilities: Unpatched servers, misconfigured networking, exposed management interfaces
- Model weight theft: Physical or network access to model files enables complete model extraction
- Custom code vulnerabilities: Bespoke guardrails and API layers may have security flaws that mature provider implementations do not
- Inference engine vulnerabilities: vLLM, text-generation-inference, and other serving frameworks may have their own security issues
- Dependency chain: Custom deployments depend on many libraries (transformers, torch, CUDA), each a potential attack vector
- Configuration drift: Without centralized management, security configurations may degrade over time
Advantages for Defenders
Self-hosted deployments also offer security advantages:
- Full control over data — no data leaves the organization's infrastructure
- Ability to implement domain-specific guardrails that providers do not offer
- No dependency on a third-party provider's security posture
- Ability to run models in air-gapped environments for highly sensitive workloads
Pattern 3: Edge Deployment
The model runs on end-user devices — smartphones, laptops, IoT devices, or embedded systems. This pattern is growing rapidly with the release of efficient small models.
Architecture
┌─────────────────────────────────┐
│ End-User Device │
│ ├─ Application │
│ ├─ Local Inference Runtime │
│ │ (ONNX, Core ML, llama.cpp)│
│ ├─ Model Weights (quantized) │
│ └─ (Optional) Client-side │
│ guardrails │
└─────────────────────────────────┘
│ (optional)
▼
┌──────────────────────┐
│ Backend Services │
│ (analytics, updates)│
└──────────────────────┘
Security Characteristics
| Aspect | Assessment |
|---|---|
| Authentication | N/A — the user controls the device |
| Rate limiting | Not applicable — local execution has no external rate limit |
| Guardrails | Client-side only — can be bypassed by the device owner |
| Monitoring | Limited to what the app reports back — easily circumvented |
| Model access | Weights are on the device — can be extracted with device access |
| Update control | Dependent on user accepting updates |
Attack Surface
Edge deployment fundamentally changes the security model because the user is the potential attacker and they control the execution environment:
- Model weight extraction: Weights are stored on the device and can be extracted, reverse-engineered, or modified
- Guardrail removal: Any client-side safety measures can be disabled by modifying the application
- Model modification: Weights can be altered to remove safety fine-tuning or inject backdoors
- Unrestricted inference: No rate limits, no monitoring, no usage restrictions once the model is on the device
- Derivative model creation: Extracted weights can be used to create fine-tuned variants without safety constraints
Pattern 4: Hybrid Deployment
Hybrid architectures combine multiple patterns, typically using edge deployment for simple tasks and cloud deployment for complex ones.
Architecture
┌─────────────────────┐ ┌─────────────────────┐
│ End-User Device │ │ Cloud Backend │
│ ├─ Small Local │ ─────→ │ ├─ Full Model │
│ │ Model │ Complex │ ├─ Guardrails │
│ ├─ Simple Tasks │ queries │ ├─ Monitoring │
│ └─ Routing Logic │ ←───── │ └─ Analytics │
│ │ Results │ │
└─────────────────────┘ └─────────────────────┘
Security Characteristics
Hybrid deployments combine the attack surfaces of both edge and cloud patterns, plus introduce additional risks:
- Routing manipulation: Tricking the routing logic into sending sensitive queries to the local (ungaurded) model instead of the cloud model
- Trust boundary confusion: The system must decide which model to trust when they disagree
- Data leakage through routing: The routing decision itself may reveal information about query sensitivity
- Inconsistent behavior: Users may discover that the local and cloud models respond differently to the same input, revealing the existence of server-side guardrails
Deployment Pattern Comparison
| Factor | API-Based | Self-Hosted | Edge | Hybrid |
|---|---|---|---|---|
| Guardrail enforcement | Strong | Variable | Weak | Mixed |
| Monitoring capability | Full | Full | Limited | Partial |
| Model weight protection | Strong | Moderate | None | Mixed |
| Data privacy | Low (data sent to provider) | High | High | Variable |
| Attack sophistication required | Medium | Medium-High | Low | Medium |
| Update speed | Fast (provider-controlled) | Slow (org-controlled) | Slowest (user-controlled) | Variable |
| Regulatory compliance | Provider-dependent | Organization-controlled | Challenging | Complex |
Identifying the Deployment Pattern
During the reconnaissance phase of an engagement, several indicators reveal the deployment pattern:
| Indicator | Suggests |
|---|---|
Requests go to api.openai.com, api.anthropic.com, etc. | API-based (direct provider) |
| Requests go to the organization's domain with AI-related endpoints | Self-hosted or proxied API |
| Model responses have consistent latency regardless of model size | API-based (provider handles scaling) |
| Responses work offline | Edge deployment |
| Response quality varies based on connectivity | Hybrid deployment |
| API errors reference vLLM, TGI, or Triton | Self-hosted |
| Response headers include provider-specific identifiers | API-based |
Related Topics
- The AI Landscape — the broader ecosystem context for deployment decisions
- The AI API Ecosystem — deep dive into API-based deployment patterns
- Open vs Closed Models — how model availability affects deployment options
- AI System Architecture — system-level view of deployment architectures
References
- "MLOps: Continuous Delivery for Machine Learning" - Google (2024) - Best practices for deploying and operating ML systems in production environments
- "On-Device AI: Challenges and Opportunities" - Apple ML Research (2024) - Technical overview of deploying AI models on edge devices with resource constraints
- "vLLM: Efficient Memory Management for Large Language Model Serving" - Kwon et al. (2023) - The inference engine underlying many self-hosted LLM deployments
- "Securing AI Model Deployment: A Practitioner's Guide" - OWASP (2025) - Security considerations for each deployment pattern in AI applications
Why is edge deployment considered the weakest deployment pattern from a security perspective?