Feature Store Access Control
Access control strategies for feature stores: feature-level permissions, cross-team data leakage prevention, PII protection in features, service account management, and implementing least-privilege access for ML feature infrastructure.
Feature Store Access Control
Feature stores are designed for sharing. Their primary value proposition is that features computed by one team can be consumed by another, eliminating redundant data engineering. This sharing model creates a fundamental tension with access control: the more features are shared, the greater the risk of unauthorized data access, PII exposure, and cross-team data leakage. Getting the access control model right determines whether a feature store is a productivity tool or a compliance liability.
Access Control Dimensions
Feature stores require access control across multiple dimensions, each with different granularity needs:
Who Can Read Features
| Consumer | Legitimate Access | Risk If Overly Broad |
|---|---|---|
| Training pipelines | Read offline store for training data | Unauthorized access to sensitive feature history |
| Inference services | Read online store for predictions | Real-time access to user-level data |
| Data scientists | Explore features for model development | PII exposure in development environments |
| Analytics teams | Aggregate feature statistics | Individual-level data accessed through aggregation |
| External partners | Shared features for joint models | Cross-organization data leakage |
Who Can Write Features
| Producer | Legitimate Access | Risk If Overly Broad |
|---|---|---|
| Feature pipelines | Write computed features to stores | Unauthorized data injection or poisoning |
| Materialization jobs | Sync offline to online store | Inconsistency attacks if compromised |
| Backfill jobs | Populate historical features | Historical data modification |
| Admin operations | Schema changes, corrections | Broad data modification capability |
Who Can Define Features
| Actor | Legitimate Access | Risk If Overly Broad |
|---|---|---|
| Feature engineers | Create and modify feature definitions | Unauthorized feature creation exposing sensitive data |
| Platform admins | Manage feature store infrastructure | Full data access through admin privileges |
| ML engineers | Request new features | Indirect access to data through feature requests |
Feature-Level Permissions
The Granularity Problem
Most feature stores implement access control at the project or namespace level, not at the individual feature level. This means:
- A user with access to a project can read ALL features in that project
- Sensitive and non-sensitive features in the same project share access controls
- Moving sensitive features to a separate project fragments the feature store's value
Implementing Feature-Level Access
Classify features by sensitivity
Assign sensitivity levels to each feature based on its data source and content:
Sensitivity Examples Access Policy Public Product category, day of week Any authenticated user Internal Aggregate user counts, model scores Team members only Confidential User demographics, transaction amounts Specific role holders Restricted SSN-derived features, health indicators Approved use cases only Map features to access groups
Create access groups that correspond to legitimate use cases rather than organizational hierarchy. A "fraud detection" access group needs transaction features and behavioral features but not demographic features.
Implement proxy-based access control
Because most feature stores lack native feature-level permissions, implement access control at the API layer:
from typing import List, Dict class FeatureAccessProxy: """Proxy that enforces feature-level access control in front of a feature store that lacks native support.""" def __init__(self, feature_store, policy_engine): self.store = feature_store self.policy = policy_engine def get_features( self, entity_id: str, feature_names: List[str], requester: str ) -> Dict[str, any]: # Check access for each requested feature allowed_features = [] denied_features = [] for feature in feature_names: if self.policy.check_access(requester, feature): allowed_features.append(feature) else: denied_features.append(feature) if denied_features: self.policy.log_access_denial( requester, denied_features ) raise AccessDenied( f"Access denied to features: {denied_features}" ) return self.store.get_features( entity_id, allowed_features )Audit feature access patterns
Log every feature access with the requester identity, features accessed, entity IDs queried, and timestamp. Review these logs for anomalous patterns.
Cross-Team Data Leakage
How Leakage Happens
Feature stores are designed to break down data silos, but this creates leakage vectors:
Direct feature access. Team A creates features from sensitive data. Team B discovers and uses these features for their models. Team B now has access to data they were not authorized to see, albeit in transformed form.
Feature composition. Team B creates a derived feature that combines Team A's sensitive feature with other data. The derived feature inherits the sensitivity of its inputs, but the feature store does not track this propagation.
Feature discovery. The feature registry allows teams to browse available features. Even feature names and descriptions can reveal sensitive information: a feature called customer_churn_risk_score reveals that the organization is tracking churn risk.
Training data reconstruction. A model trained on features from the feature store may memorize and expose feature values through prediction API probing. The feature store's access controls do not extend to the model's predictions.
Prevention Strategies
| Strategy | What It Prevents | Limitation |
|---|---|---|
| Feature namespacing | Direct cross-team access | Does not prevent authorized sharing that leaks data |
| Sensitivity tagging | Inadvertent use of sensitive features | Requires accurate classification |
| Approval workflows | Unauthorized feature consumption | Can become a bottleneck |
| Feature masking | PII exposure in non-production | Adds complexity to development workflow |
| Lineage tracking | Unknown sensitivity propagation | Requires comprehensive lineage infrastructure |
PII in Features
Where PII Appears
PII enters the feature store through multiple paths:
| Path | Examples | Risk |
|---|---|---|
| Direct features | Name, email, SSN, date of birth | Obvious PII; should be caught by classification |
| Derived features | Age calculated from DOB, zip code from address | PII-derived; sensitivity inherited from source |
| Behavioral features | Browsing history, purchase patterns, location traces | Behavioral data that identifies individuals |
| Embeddings | Text embeddings of user messages, profile embeddings | PII encoded in vector representations; extractable |
| Aggregate features | Average spend in zip code with < 5 residents | Small-group aggregates that can identify individuals |
Embedding PII Risk
PII Protection Strategies
Feature masking. Replace PII feature values with masked versions in non-production environments. Production models that need PII features access them through a separate, audited path.
Differential privacy. Add calibrated noise to features during computation. The noise preserves statistical properties for model training while preventing identification of individuals.
Tokenization. Replace PII values with tokens (pseudonymization). The token-to-PII mapping is stored in a separate, access-controlled system.
Feature-level encryption. Encrypt sensitive feature values at rest and decrypt only in the inference path. Development and analytics access see encrypted values.
Service Account Management
The Service Account Problem
Feature stores interact with many components through service accounts:
| Service Account | Used By | Access Needed | Common Over-Permissioning |
|---|---|---|---|
| Materialization SA | Sync pipeline | Read offline, write online | Full read/write to both stores |
| Training SA | Training pipeline | Read offline store | Read access to all features including unneeded ones |
| Inference SA | Serving infrastructure | Read online store | Access to all entities, not just those being served |
| Backfill SA | Data engineering | Write offline store | Write access to all features and time ranges |
| Admin SA | Operations | Manage schemas and access | Full admin access to everything |
Least-Privilege for Service Accounts
Enumerate all service accounts
Identify every service account that interacts with the feature store. Include CI/CD pipelines, scheduled jobs, and interactive access.
Map required permissions
For each service account, determine the minimum set of permissions required for its function. A training pipeline for fraud detection does not need access to recommendation features.
Implement scoped credentials
Create separate credentials for each use case. Use short-lived tokens (OIDC) where possible instead of long-lived API keys.
Monitor for permission drift
Regularly audit service account permissions against their documented requirements. Permissions tend to accumulate over time as new use cases are added without removing old access.
Audit and Compliance
What to Log
| Event | Details to Capture | Retention |
|---|---|---|
| Feature read | Requester, features, entity IDs, timestamp | 90 days minimum |
| Feature write | Writer, features, values, timestamp | 1 year minimum |
| Schema change | Actor, change details, before/after | Indefinite |
| Access grant/revoke | Admin, target, permissions, timestamp | Indefinite |
| Access denial | Requester, denied features, reason | 90 days minimum |
Compliance Mapping
| Regulation | Feature Store Requirement |
|---|---|
| GDPR | Right to deletion includes features; data minimization; purpose limitation |
| CCPA | Feature data inventory; access request fulfillment; opt-out support |
| HIPAA | PHI features require BAA coverage; minimum necessary access; audit trails |
| SOC 2 | Access controls documented and tested; monitoring and alerting |
References
- Feast Access Control -- Feast permission model documentation
- GDPR and ML Systems -- UK ICO guidance on data protection in AI
- NIST Privacy Framework -- Privacy risk management
Team A creates a feature called user_spending_embedding that encodes user purchase history as a 768-dimensional vector. Team B discovers this feature in the feature store registry and uses it in their recommendation model. What security and compliance concerns does this raise?