AI API Reverse Engineering
Techniques for reverse engineering AI APIs including mapping undocumented endpoints, parameter discovery, rate limit profiling, and extracting implementation details from API behavior.
AI API reverse engineering is the process of systematically discovering the full capabilities, constraints, and implementation details of an AI service by analyzing its external behavior. Unlike traditional API reconnaissance, AI APIs present unique challenges and opportunities: model behavior leaks implementation details, error messages may reveal architecture, and rate limiting patterns expose infrastructure decisions.
Endpoint Discovery
Systematic Endpoint Mapping
AI applications typically expose multiple endpoints beyond the primary chat or completion API. Discovering these reveals the full attack surface:
| Endpoint Category | Common Patterns | What They Reveal |
|---|---|---|
| Inference endpoints | /v1/chat/completions, /v1/completions, /predict, /generate | Primary AI functionality, model versions |
| Embedding endpoints | /v1/embeddings, /embed, /encode | Embedding model details, dimension sizes |
| Fine-tuning endpoints | /v1/fine-tuning/jobs, /train, /finetune | Fine-tuning capabilities, training data handling |
| Model management | /v1/models, /models/list, /deployments | Available models, versioning, deployment info |
| File/data management | /v1/files, /upload, /datasets | Data handling capabilities, storage details |
| Admin/internal | /admin, /internal, /debug, /health, /metrics | Infrastructure details, debugging information |
| Batch processing | /v1/batch, /bulk, /async | Batch capabilities, queue architecture |
Discovery Techniques
Documentation analysis
Start with public documentation, OpenAPI/Swagger specs, SDK source code, and changelog history. Even well-documented APIs often have undocumented features.
Sources to examine:
- Official API documentation and changelogs
- SDK source code (Python, JavaScript, etc.) -- often exposes endpoints not in public docs
- OpenAPI/Swagger specification files
- Developer forum posts and community discussions
- Archived versions of documentation (Wayback Machine)
Path enumeration
Systematically probe for common API path patterns using wordlists tailored to AI services.
AI-specific path patterns to test:
/v1/, /v2/, /api/, /ml/, /ai/ /completions, /chat, /generate, /predict, /classify /embeddings, /encode, /vectors /models, /deployments, /engines /fine-tune, /train, /finetune, /jobs /files, /upload, /documents, /datasets /admin, /internal, /debug, /metrics, /health /moderate, /filter, /safety, /content-filter /tokenize, /count-tokens, /usageVersion enumeration
Test for multiple API versions that may expose different capabilities or have different security controls.
Version patterns:
/v1/..., /v2/..., /v3/... /api/v1/..., /api/v2/... /2023-01-01/..., /2024-01-01/... (date-versioned) /?api-version=2024-01-01 (query parameter versioned)Method enumeration
For discovered endpoints, test all HTTP methods (GET, POST, PUT, PATCH, DELETE, OPTIONS) to find additional functionality.
Parameter Discovery
Hidden Parameter Detection
AI APIs often accept parameters not listed in public documentation. Discovering these can reveal hidden capabilities, debugging features, or security-relevant configuration options.
| Discovery Method | Technique | What It Finds |
|---|---|---|
| Error message analysis | Send malformed requests and analyze error messages | Parameter names, valid values, type constraints |
| SDK source analysis | Read client library source code | All parameters the SDK supports, including undocumented ones |
| Header analysis | Examine response headers for configuration hints | Model version, server identity, feature flags |
| Fuzzing | Send requests with common parameter names | Hidden parameters that modify behavior |
| Differential analysis | Compare behavior with/without specific parameters | Parameters that change output but are not documented |
Common Hidden Parameters in AI APIs
| Parameter | Purpose | Security Relevance |
|---|---|---|
raw_response / include_raw | Return unfiltered model output | May bypass content filtering |
debug / verbose | Return debugging information | May expose internal architecture |
model_version / engine | Select specific model version | May access older, less-secured versions |
system_prompt / instructions | Override system-level instructions | Direct prompt injection vector |
max_retries / retry_strategy | Control retry behavior | Resource exhaustion potential |
cache / use_cache | Control response caching | May return other users' cached responses |
internal / admin | Enable administrative features | Privilege escalation |
filter_level / safety_level | Control content filtering strictness | May reduce safety controls |
logprobs / log_probabilities | Return token-level probabilities | Information leakage about model internals |
echo / echo_prompt | Echo the prompt in the response | May reveal processed/modified prompts |
Parameter Fuzzing Methodology
For each discovered endpoint:
1. Send a baseline request with documented parameters
2. Add one candidate parameter at a time
3. Compare response to baseline:
- Different status code → parameter recognized
- Different response body → parameter affects behavior
- Different response time → parameter triggers processing
- Specific error message → parameter exists but value is wrong
4. For recognized parameters, enumerate valid values:
- Boolean: true/false
- Numeric: boundary values (0, 1, -1, MAX_INT)
- String: common values, empty string, special characters
- Enum: infer from error messages
Rate Limit Profiling
Why Rate Limits Matter for Red Teaming
Rate limits reveal infrastructure architecture and constrain attack feasibility:
| Rate Limit Information | What It Reveals | Red Team Implication |
|---|---|---|
| Requests per minute | API capacity planning, tier configuration | Constrains brute-force and enumeration speed |
| Tokens per minute | Model serving capacity | Constrains output extraction volume |
| Concurrent request limit | Infrastructure parallelism | Limits multi-threaded attacks |
| Per-key vs per-IP limiting | Authentication architecture | Rotating keys may bypass limits |
| Reset mechanism | Time-based vs sliding window | Timing-based evasion strategies |
| Burst allowance | Short-term capacity | Enables rapid initial enumeration |
Rate Limit Discovery Techniques
| Technique | How to Execute | What to Look For |
|---|---|---|
| Header analysis | Examine X-RateLimit-*, Retry-After headers | Limit values, remaining quota, reset timing |
| Incremental loading | Gradually increase request rate until limited | Exact threshold, throttling behavior |
| Burst testing | Send rapid concurrent requests | Burst allowance, queuing behavior |
| Multi-key testing | Use different API keys from same IP | Per-key vs per-account vs per-IP limiting |
| Multi-IP testing | Use different IPs with same API key | IP-based vs key-based enforcement |
| Endpoint comparison | Compare rate limits across different endpoints | Inconsistent enforcement, weaker endpoints |
| Time analysis | Track rate limit resets over time | Fixed window vs sliding window, reset timing |
Rate Limit Bypass Techniques
| Technique | Method | Applicability |
|---|---|---|
| Key rotation | Use multiple API keys to distribute requests | When rate limiting is per-key |
| IP rotation | Distribute requests across multiple source IPs | When rate limiting includes IP component |
| Endpoint substitution | Use alternative endpoints for the same functionality | When different endpoints have different limits |
| Batch optimization | Maximize information per request to reduce request count | When rate limiting is request-based |
| Timing optimization | Time requests to maximize throughput within limits | When rate limiting uses fixed windows |
| Version targeting | Use older API versions with different rate limits | When rate limits differ by API version |
Implementation Detail Extraction
Error Message Analysis
AI API error messages frequently leak implementation details:
| Error Pattern | Information Leaked | Example |
|---|---|---|
| Model not found errors | Available model names and versions | "model 'gpt-4-turbo-preview' not found. Available: gpt-4o, gpt-4o-mini" |
| Validation errors | Parameter constraints and types | "temperature must be between 0 and 2" |
| Rate limit errors | Limit values and reset times | "Rate limit exceeded: 60 RPM. Reset in 42s" |
| Internal server errors | Backend technology stack | Stack traces revealing Python/Java versions, framework names |
| Token limit errors | Context window sizes | "Maximum context length is 128000 tokens" |
| Content filter errors | Filtering categories and thresholds | "Content filtered: category=violence, severity=high" |
Timing Analysis
Response timing reveals model and infrastructure characteristics:
| Timing Pattern | What It Indicates | Measurement Method |
|---|---|---|
| Consistent low latency | Cached responses or simple models | Repeat identical requests, measure variance |
| Latency proportional to output length | Autoregressive generation (token-by-token) | Vary max_tokens, correlate with response time |
| Bimodal latency distribution | Load balancing across different hardware | Large sample of requests, plot distribution |
| Sudden latency spikes | Queue saturation or cold starts | Sustained load testing |
| Consistent first-token latency | Prefill time reveals model size | Measure time-to-first-token with varying input lengths |
Response Header Intelligence
| Header | Information Revealed |
|---|---|
Server | Web server technology |
X-Request-ID | Request routing and infrastructure topology |
X-Model-Version | Specific model version serving requests |
X-Processing-Time | Backend processing time |
X-Region | Geographic region of serving infrastructure |
CF-Ray / X-Amz-* / X-Goog-* | Cloud provider identification |
X-Content-Filter-* | Content filtering details |
Building API Maps
API Map Documentation
Compile all discovered information into a structured API map:
| Section | Content |
|---|---|
| Endpoint inventory | All discovered endpoints with methods, parameters, and authentication requirements |
| Authentication scheme | API key format, OAuth flows, session management, token types |
| Rate limit profile | Limits per endpoint, per key, per IP; reset mechanisms; burst allowances |
| Model inventory | Available models with versions, context windows, capabilities |
| Parameter catalog | All parameters (documented and undocumented) with types and valid values |
| Error taxonomy | Error codes, messages, and what they reveal |
| Infrastructure fingerprint | Cloud provider, CDN, load balancing, geographic distribution |
| Security controls | Content filtering, input validation, output sanitization |
| Timing profiles | Latency characteristics per endpoint and configuration |
From Reconnaissance to Exploitation
Each piece of API intelligence maps to potential exploitation paths:
| Discovery | Exploitation Opportunity |
|---|---|
| Undocumented admin endpoint | Privilege escalation, configuration modification |
Hidden debug parameter | Information disclosure, filter bypass |
| Inconsistent rate limits | Resource exhaustion on weakly-limited endpoints |
| Older API version available | Attack deprecated security controls |
| Content filter categories in errors | Targeted filter evasion based on category boundaries |
| Model version in headers | Target known vulnerabilities in specific model versions |
| Cached response behavior | Cross-user data leakage through cache poisoning |
Defensive Awareness
Understanding these techniques also informs defensive recommendations:
| Reconnaissance Technique | Defensive Countermeasure |
|---|---|
| Path enumeration | Return consistent 404 for all non-existent paths (no behavioral difference) |
| Parameter fuzzing | Ignore unknown parameters silently (no error messages) |
| Error message analysis | Standardize error messages, strip implementation details |
| Timing analysis | Add random jitter to response times |
| Header analysis | Minimize information in response headers |
| Version enumeration | Deprecate and remove old API versions promptly |
| Rate limit profiling | Use consistent rate limiting across all endpoints |
Red teamers should include both the offensive findings and these defensive recommendations in their reports, helping organizations harden their API surfaces against future reconnaissance.