AI API Reverse Engineering

expert10 min readUpdated 2026-03-15

Techniques for reverse engineering AI APIs including mapping undocumented endpoints, parameter discovery, rate limit profiling, and extracting implementation details from API behavior.

api reverse-engineering reconnaissance endpoints rate-limiting

AI API reverse engineering is the process of systematically discovering the full capabilities, constraints, and implementation details of an AI service by analyzing its external behavior. Unlike traditional API reconnaissance, AI APIs present unique challenges and opportunities: model behavior leaks implementation details, error messages may reveal architecture, and rate limiting patterns expose infrastructure decisions.

Endpoint Discovery

Systematic Endpoint Mapping

AI applications typically expose multiple endpoints beyond the primary chat or completion API. Discovering these reveals the full attack surface:

Endpoint Category	Common Patterns	What They Reveal
Inference endpoints	`/v1/chat/completions`, `/v1/completions`, `/predict`, `/generate`	Primary AI functionality, model versions
Embedding endpoints	`/v1/embeddings`, `/embed`, `/encode`	Embedding model details, dimension sizes
Fine-tuning endpoints	`/v1/fine-tuning/jobs`, `/train`, `/finetune`	Fine-tuning capabilities, training data handling
Model management	`/v1/models`, `/models/list`, `/deployments`	Available models, versioning, deployment info
File/data management	`/v1/files`, `/upload`, `/datasets`	Data handling capabilities, storage details
Admin/internal	`/admin`, `/internal`, `/debug`, `/health`, `/metrics`	Infrastructure details, debugging information
Batch processing	`/v1/batch`, `/bulk`, `/async`	Batch capabilities, queue architecture

Discovery Techniques

Documentation analysis
Start with public documentation, OpenAPI/Swagger specs, SDK source code, and changelog history. Even well-documented APIs often have undocumented features.
Sources to examine:
- Official API documentation and changelogs
- SDK source code (Python, JavaScript, etc.) -- often exposes endpoints not in public docs
- OpenAPI/Swagger specification files
- Developer forum posts and community discussions
- Archived versions of documentation (Wayback Machine)

Path enumeration

Systematically probe for common API path patterns using wordlists tailored to AI services.

AI-specific path patterns to test:

/v1/, /v2/, /api/, /ml/, /ai/
/completions, /chat, /generate, /predict, /classify
/embeddings, /encode, /vectors
/models, /deployments, /engines
/fine-tune, /train, /finetune, /jobs
/files, /upload, /documents, /datasets
/admin, /internal, /debug, /metrics, /health
/moderate, /filter, /safety, /content-filter
/tokenize, /count-tokens, /usage

Version enumeration

Test for multiple API versions that may expose different capabilities or have different security controls.

Version patterns:

/v1/..., /v2/..., /v3/...
/api/v1/..., /api/v2/...
/2023-01-01/..., /2024-01-01/... (date-versioned)
/?api-version=2024-01-01 (query parameter versioned)

Method enumeration
For discovered endpoints, test all HTTP methods (GET, POST, PUT, PATCH, DELETE, OPTIONS) to find additional functionality.

Parameter Discovery

Hidden Parameter Detection

AI APIs often accept parameters not listed in public documentation. Discovering these can reveal hidden capabilities, debugging features, or security-relevant configuration options.

Discovery Method	Technique	What It Finds
Error message analysis	Send malformed requests and analyze error messages	Parameter names, valid values, type constraints
SDK source analysis	Read client library source code	All parameters the SDK supports, including undocumented ones
Header analysis	Examine response headers for configuration hints	Model version, server identity, feature flags
Fuzzing	Send requests with common parameter names	Hidden parameters that modify behavior
Differential analysis	Compare behavior with/without specific parameters	Parameters that change output but are not documented

Common Hidden Parameters in AI APIs

Parameter	Purpose	Security Relevance
`raw_response` / `include_raw`	Return unfiltered model output	May bypass content filtering
`debug` / `verbose`	Return debugging information	May expose internal architecture
`model_version` / `engine`	Select specific model version	May access older, less-secured versions
`system_prompt` / `instructions`	Override system-level instructions	Direct prompt injection vector
`max_retries` / `retry_strategy`	Control retry behavior	Resource exhaustion potential
`cache` / `use_cache`	Control response caching	May return other users' cached responses
`internal` / `admin`	Enable administrative features	Privilege escalation
`filter_level` / `safety_level`	Control content filtering strictness	May reduce safety controls
`logprobs` / `log_probabilities`	Return token-level probabilities	Information leakage about model internals
`echo` / `echo_prompt`	Echo the prompt in the response	May reveal processed/modified prompts

Parameter Fuzzing Methodology

For each discovered endpoint:
  1. Send a baseline request with documented parameters
  2. Add one candidate parameter at a time
  3. Compare response to baseline:
     - Different status code → parameter recognized
     - Different response body → parameter affects behavior
     - Different response time → parameter triggers processing
     - Specific error message → parameter exists but value is wrong
  4. For recognized parameters, enumerate valid values:
     - Boolean: true/false
     - Numeric: boundary values (0, 1, -1, MAX_INT)
     - String: common values, empty string, special characters
     - Enum: infer from error messages

Rate Limit Profiling

Why Rate Limits Matter for Red Teaming

Rate limits reveal infrastructure architecture and constrain attack feasibility:

Rate Limit Information	What It Reveals	Red Team Implication
Requests per minute	API capacity planning, tier configuration	Constrains brute-force and enumeration speed
Tokens per minute	Model serving capacity	Constrains output extraction volume
Concurrent request limit	Infrastructure parallelism	Limits multi-threaded attacks
Per-key vs per-IP limiting	Authentication architecture	Rotating keys may bypass limits
Reset mechanism	Time-based vs sliding window	Timing-based evasion strategies
Burst allowance	Short-term capacity	Enables rapid initial enumeration

Rate Limit Discovery Techniques

Technique	How to Execute	What to Look For
Header analysis	Examine `X-RateLimit-*`, `Retry-After` headers	Limit values, remaining quota, reset timing
Incremental loading	Gradually increase request rate until limited	Exact threshold, throttling behavior
Burst testing	Send rapid concurrent requests	Burst allowance, queuing behavior
Multi-key testing	Use different API keys from same IP	Per-key vs per-account vs per-IP limiting
Multi-IP testing	Use different IPs with same API key	IP-based vs key-based enforcement
Endpoint comparison	Compare rate limits across different endpoints	Inconsistent enforcement, weaker endpoints
Time analysis	Track rate limit resets over time	Fixed window vs sliding window, reset timing

Rate Limit Bypass Techniques

Technique	Method	Applicability
Key rotation	Use multiple API keys to distribute requests	When rate limiting is per-key
IP rotation	Distribute requests across multiple source IPs	When rate limiting includes IP component
Endpoint substitution	Use alternative endpoints for the same functionality	When different endpoints have different limits
Batch optimization	Maximize information per request to reduce request count	When rate limiting is request-based
Timing optimization	Time requests to maximize throughput within limits	When rate limiting uses fixed windows
Version targeting	Use older API versions with different rate limits	When rate limits differ by API version

Implementation Detail Extraction

Error Message Analysis

AI API error messages frequently leak implementation details:

Error Pattern	Information Leaked	Example
Model not found errors	Available model names and versions	`"model 'gpt-4-turbo-preview' not found. Available: gpt-4o, gpt-4o-mini"`
Validation errors	Parameter constraints and types	`"temperature must be between 0 and 2"`
Rate limit errors	Limit values and reset times	`"Rate limit exceeded: 60 RPM. Reset in 42s"`
Internal server errors	Backend technology stack	Stack traces revealing Python/Java versions, framework names
Token limit errors	Context window sizes	`"Maximum context length is 128000 tokens"`
Content filter errors	Filtering categories and thresholds	`"Content filtered: category=violence, severity=high"`

Timing Analysis

Response timing reveals model and infrastructure characteristics:

Timing Pattern	What It Indicates	Measurement Method
Consistent low latency	Cached responses or simple models	Repeat identical requests, measure variance
Latency proportional to output length	Autoregressive generation (token-by-token)	Vary `max_tokens`, correlate with response time
Bimodal latency distribution	Load balancing across different hardware	Large sample of requests, plot distribution
Sudden latency spikes	Queue saturation or cold starts	Sustained load testing
Consistent first-token latency	Prefill time reveals model size	Measure time-to-first-token with varying input lengths

Response Header Intelligence

Header	Information Revealed
`Server`	Web server technology
`X-Request-ID`	Request routing and infrastructure topology
`X-Model-Version`	Specific model version serving requests
`X-Processing-Time`	Backend processing time
`X-Region`	Geographic region of serving infrastructure
`CF-Ray` / `X-Amz-` / `X-Goog-`	Cloud provider identification
`X-Content-Filter-*`	Content filtering details

Building API Maps

API Map Documentation

Compile all discovered information into a structured API map:

Section	Content
Endpoint inventory	All discovered endpoints with methods, parameters, and authentication requirements
Authentication scheme	API key format, OAuth flows, session management, token types
Rate limit profile	Limits per endpoint, per key, per IP; reset mechanisms; burst allowances
Model inventory	Available models with versions, context windows, capabilities
Parameter catalog	All parameters (documented and undocumented) with types and valid values
Error taxonomy	Error codes, messages, and what they reveal
Infrastructure fingerprint	Cloud provider, CDN, load balancing, geographic distribution
Security controls	Content filtering, input validation, output sanitization
Timing profiles	Latency characteristics per endpoint and configuration

From Reconnaissance to Exploitation

Each piece of API intelligence maps to potential exploitation paths:

Discovery	Exploitation Opportunity
Undocumented admin endpoint	Privilege escalation, configuration modification
Hidden `debug` parameter	Information disclosure, filter bypass
Inconsistent rate limits	Resource exhaustion on weakly-limited endpoints
Older API version available	Attack deprecated security controls
Content filter categories in errors	Targeted filter evasion based on category boundaries
Model version in headers	Target known vulnerabilities in specific model versions
Cached response behavior	Cross-user data leakage through cache poisoning

Defensive Awareness

Understanding these techniques also informs defensive recommendations:

Reconnaissance Technique	Defensive Countermeasure
Path enumeration	Return consistent 404 for all non-existent paths (no behavioral difference)
Parameter fuzzing	Ignore unknown parameters silently (no error messages)
Error message analysis	Standardize error messages, strip implementation details
Timing analysis	Add random jitter to response times
Header analysis	Minimize information in response headers
Version enumeration	Deprecate and remove old API versions promptly
Rate limit profiling	Use consistent rate limiting across all endpoints

Red teamers should include both the offensive findings and these defensive recommendations in their reports, helping organizations harden their API surfaces against future reconnaissance.

Edit this page on GitHub

AI API Reverse Engineering

expert10 min readUpdated 2026-03-15

Techniques for reverse engineering AI APIs including mapping undocumented endpoints, parameter discovery, rate limit profiling, and extracting implementation details from API behavior.

api reverse-engineering reconnaissance endpoints rate-limiting

Endpoint Discovery

Systematic Endpoint Mapping

AI applications typically expose multiple endpoints beyond the primary chat or completion API. Discovering these reveals the full attack surface:

Endpoint Category	Common Patterns	What They Reveal
Inference endpoints	`/v1/chat/completions`, `/v1/completions`, `/predict`, `/generate`	Primary AI functionality, model versions
Embedding endpoints	`/v1/embeddings`, `/embed`, `/encode`	Embedding model details, dimension sizes
Fine-tuning endpoints	`/v1/fine-tuning/jobs`, `/train`, `/finetune`	Fine-tuning capabilities, training data handling
Model management	`/v1/models`, `/models/list`, `/deployments`	Available models, versioning, deployment info
File/data management	`/v1/files`, `/upload`, `/datasets`	Data handling capabilities, storage details
Admin/internal	`/admin`, `/internal`, `/debug`, `/health`, `/metrics`	Infrastructure details, debugging information
Batch processing	`/v1/batch`, `/bulk`, `/async`	Batch capabilities, queue architecture

Discovery Techniques

Documentation analysis
Start with public documentation, OpenAPI/Swagger specs, SDK source code, and changelog history. Even well-documented APIs often have undocumented features.
Sources to examine:
- Official API documentation and changelogs
- SDK source code (Python, JavaScript, etc.) -- often exposes endpoints not in public docs
- OpenAPI/Swagger specification files
- Developer forum posts and community discussions
- Archived versions of documentation (Wayback Machine)

Path enumeration

Systematically probe for common API path patterns using wordlists tailored to AI services.

AI-specific path patterns to test:

/v1/, /v2/, /api/, /ml/, /ai/
/completions, /chat, /generate, /predict, /classify
/embeddings, /encode, /vectors
/models, /deployments, /engines
/fine-tune, /train, /finetune, /jobs
/files, /upload, /documents, /datasets
/admin, /internal, /debug, /metrics, /health
/moderate, /filter, /safety, /content-filter
/tokenize, /count-tokens, /usage

Version enumeration

Test for multiple API versions that may expose different capabilities or have different security controls.

Version patterns:

/v1/..., /v2/..., /v3/...
/api/v1/..., /api/v2/...
/2023-01-01/..., /2024-01-01/... (date-versioned)
/?api-version=2024-01-01 (query parameter versioned)

Method enumeration
For discovered endpoints, test all HTTP methods (GET, POST, PUT, PATCH, DELETE, OPTIONS) to find additional functionality.

Parameter Discovery

Hidden Parameter Detection

AI APIs often accept parameters not listed in public documentation. Discovering these can reveal hidden capabilities, debugging features, or security-relevant configuration options.

Discovery Method	Technique	What It Finds
Error message analysis	Send malformed requests and analyze error messages	Parameter names, valid values, type constraints
SDK source analysis	Read client library source code	All parameters the SDK supports, including undocumented ones
Header analysis	Examine response headers for configuration hints	Model version, server identity, feature flags
Fuzzing	Send requests with common parameter names	Hidden parameters that modify behavior
Differential analysis	Compare behavior with/without specific parameters	Parameters that change output but are not documented

Common Hidden Parameters in AI APIs

Parameter	Purpose	Security Relevance
`raw_response` / `include_raw`	Return unfiltered model output	May bypass content filtering
`debug` / `verbose`	Return debugging information	May expose internal architecture
`model_version` / `engine`	Select specific model version	May access older, less-secured versions
`system_prompt` / `instructions`	Override system-level instructions	Direct prompt injection vector
`max_retries` / `retry_strategy`	Control retry behavior	Resource exhaustion potential
`cache` / `use_cache`	Control response caching	May return other users' cached responses
`internal` / `admin`	Enable administrative features	Privilege escalation
`filter_level` / `safety_level`	Control content filtering strictness	May reduce safety controls
`logprobs` / `log_probabilities`	Return token-level probabilities	Information leakage about model internals
`echo` / `echo_prompt`	Echo the prompt in the response	May reveal processed/modified prompts

Parameter Fuzzing Methodology

For each discovered endpoint:
  1. Send a baseline request with documented parameters
  2. Add one candidate parameter at a time
  3. Compare response to baseline:
     - Different status code → parameter recognized
     - Different response body → parameter affects behavior
     - Different response time → parameter triggers processing
     - Specific error message → parameter exists but value is wrong
  4. For recognized parameters, enumerate valid values:
     - Boolean: true/false
     - Numeric: boundary values (0, 1, -1, MAX_INT)
     - String: common values, empty string, special characters
     - Enum: infer from error messages

Rate Limit Profiling

Why Rate Limits Matter for Red Teaming

Rate limits reveal infrastructure architecture and constrain attack feasibility:

Rate Limit Information	What It Reveals	Red Team Implication
Requests per minute	API capacity planning, tier configuration	Constrains brute-force and enumeration speed
Tokens per minute	Model serving capacity	Constrains output extraction volume
Concurrent request limit	Infrastructure parallelism	Limits multi-threaded attacks
Per-key vs per-IP limiting	Authentication architecture	Rotating keys may bypass limits
Reset mechanism	Time-based vs sliding window	Timing-based evasion strategies
Burst allowance	Short-term capacity	Enables rapid initial enumeration

Rate Limit Discovery Techniques

Technique	How to Execute	What to Look For
Header analysis	Examine `X-RateLimit-*`, `Retry-After` headers	Limit values, remaining quota, reset timing
Incremental loading	Gradually increase request rate until limited	Exact threshold, throttling behavior
Burst testing	Send rapid concurrent requests	Burst allowance, queuing behavior
Multi-key testing	Use different API keys from same IP	Per-key vs per-account vs per-IP limiting
Multi-IP testing	Use different IPs with same API key	IP-based vs key-based enforcement
Endpoint comparison	Compare rate limits across different endpoints	Inconsistent enforcement, weaker endpoints
Time analysis	Track rate limit resets over time	Fixed window vs sliding window, reset timing

Rate Limit Bypass Techniques

Technique	Method	Applicability
Key rotation	Use multiple API keys to distribute requests	When rate limiting is per-key
IP rotation	Distribute requests across multiple source IPs	When rate limiting includes IP component
Endpoint substitution	Use alternative endpoints for the same functionality	When different endpoints have different limits
Batch optimization	Maximize information per request to reduce request count	When rate limiting is request-based
Timing optimization	Time requests to maximize throughput within limits	When rate limiting uses fixed windows
Version targeting	Use older API versions with different rate limits	When rate limits differ by API version

Implementation Detail Extraction

Error Message Analysis

AI API error messages frequently leak implementation details:

Error Pattern	Information Leaked	Example
Model not found errors	Available model names and versions	`"model 'gpt-4-turbo-preview' not found. Available: gpt-4o, gpt-4o-mini"`
Validation errors	Parameter constraints and types	`"temperature must be between 0 and 2"`
Rate limit errors	Limit values and reset times	`"Rate limit exceeded: 60 RPM. Reset in 42s"`
Internal server errors	Backend technology stack	Stack traces revealing Python/Java versions, framework names
Token limit errors	Context window sizes	`"Maximum context length is 128000 tokens"`
Content filter errors	Filtering categories and thresholds	`"Content filtered: category=violence, severity=high"`

Timing Analysis

Response timing reveals model and infrastructure characteristics:

Timing Pattern	What It Indicates	Measurement Method
Consistent low latency	Cached responses or simple models	Repeat identical requests, measure variance
Latency proportional to output length	Autoregressive generation (token-by-token)	Vary `max_tokens`, correlate with response time
Bimodal latency distribution	Load balancing across different hardware	Large sample of requests, plot distribution
Sudden latency spikes	Queue saturation or cold starts	Sustained load testing
Consistent first-token latency	Prefill time reveals model size	Measure time-to-first-token with varying input lengths

Response Header Intelligence

Header	Information Revealed
`Server`	Web server technology
`X-Request-ID`	Request routing and infrastructure topology
`X-Model-Version`	Specific model version serving requests
`X-Processing-Time`	Backend processing time
`X-Region`	Geographic region of serving infrastructure
`CF-Ray` / `X-Amz-` / `X-Goog-`	Cloud provider identification
`X-Content-Filter-*`	Content filtering details

Building API Maps

API Map Documentation

Compile all discovered information into a structured API map:

Section	Content
Endpoint inventory	All discovered endpoints with methods, parameters, and authentication requirements
Authentication scheme	API key format, OAuth flows, session management, token types
Rate limit profile	Limits per endpoint, per key, per IP; reset mechanisms; burst allowances
Model inventory	Available models with versions, context windows, capabilities
Parameter catalog	All parameters (documented and undocumented) with types and valid values
Error taxonomy	Error codes, messages, and what they reveal
Infrastructure fingerprint	Cloud provider, CDN, load balancing, geographic distribution
Security controls	Content filtering, input validation, output sanitization
Timing profiles	Latency characteristics per endpoint and configuration

From Reconnaissance to Exploitation

Each piece of API intelligence maps to potential exploitation paths:

Discovery	Exploitation Opportunity
Undocumented admin endpoint	Privilege escalation, configuration modification
Hidden `debug` parameter	Information disclosure, filter bypass
Inconsistent rate limits	Resource exhaustion on weakly-limited endpoints
Older API version available	Attack deprecated security controls
Content filter categories in errors	Targeted filter evasion based on category boundaries
Model version in headers	Target known vulnerabilities in specific model versions
Cached response behavior	Cross-user data leakage through cache poisoning

Defensive Awareness

Understanding these techniques also informs defensive recommendations:

Reconnaissance Technique	Defensive Countermeasure
Path enumeration	Return consistent 404 for all non-existent paths (no behavioral difference)
Parameter fuzzing	Ignore unknown parameters silently (no error messages)
Error message analysis	Standardize error messages, strip implementation details
Timing analysis	Add random jitter to response times
Header analysis	Minimize information in response headers
Version enumeration	Deprecate and remove old API versions promptly
Rate limit profiling	Use consistent rate limiting across all endpoints

Red teamers should include both the offensive findings and these defensive recommendations in their reports, helping organizations harden their API surfaces against future reconnaissance.

Edit this page on GitHub

AI API Reverse Engineering

Documentation analysis

Path enumeration

Version enumeration

Method enumeration

Related articles

AI API Reverse Engineering

Documentation analysis

Path enumeration

Version enumeration

Method enumeration

Related articles