Counterfit Walkthrough
Complete walkthrough of Microsoft's Counterfit adversarial ML testing framework: installation, target configuration, running attacks against ML models, interpreting results, and automating adversarial robustness assessments.
Counterfit is Microsoft's open-source tool for assessing the security of machine learning models. Originally developed by Microsoft's AI Red Team, it provides a command-line interface for launching adversarial attacks against ML models, similar to how Metasploit provides a framework for network penetration testing. Counterfit abstracts away the complexity of adversarial ML research papers into runnable attacks that security professionals can execute without deep ML expertise.
Counterfit wraps attack algorithms from the Adversarial Robustness Toolbox (ART), TextAttack, and other libraries into a unified interface with consistent target definition, attack configuration, and result analysis.
Step 1: Installation
# Clone the Counterfit repository
git clone https://github.com/Azure/counterfit.git
cd counterfit
# Create a virtual environment
python3 -m venv counterfit-env
source counterfit-env/bin/activate
# Install Counterfit and dependencies
pip install -e .
# Launch the Counterfit CLI
counterfitAfter installation, you should see the Counterfit interactive shell:
__ _____ __
_________ __ ______ / /____ _____/ __(_) /_
/ ___/ __ \/ / / / __ \/ __/ _ \/ ___/ /_/ / __/
/ /__/ /_/ / /_/ / / / / /_/ __/ / / __/ / /_
\___/\____/\__,_/_/ /_/\__/\___/_/ /_/ /_/\__/
counterfit>
Step 2: Understanding Counterfit Architecture
Counterfit organizes its functionality around three concepts:
Targets — The ML models you want to test. Each target wraps a model endpoint (local or remote) with metadata about input format, output format, and classification labels.
Attacks — Adversarial algorithms that modify inputs to cause model misclassification or other undesired behavior. Attacks come from ART, TextAttack, and custom implementations.
Frameworks — The underlying attack libraries. Counterfit manages framework compatibility and translates between its unified interface and framework-specific APIs.
counterfit> list targets
┌──────────────────────┬────────────┬──────────────┐
│ Target Name │ Data Type │ Endpoint │
├──────────────────────┼────────────┼──────────────┤
│ creditfraud │ tabular │ local │
│ digits_keras │ image │ local │
│ movie_reviews │ text │ local │
│ satellite │ image │ local │
└──────────────────────┴────────────┴──────────────┘
Step 3: Defining a Custom Target
To test your own model, create a target definition. Counterfit targets are Python classes that inherit from the appropriate base class.
# targets/my_classifier/my_classifier.py
"""
Custom Counterfit target for a sentiment classifier API.
"""
import numpy as np
import requests
from counterfit.core.targets import CFTarget
class MyClassifier(CFTarget):
target_name = "my_classifier"
target_data_type = "text"
target_endpoint = "https://api.example.com/classify"
target_input_shape = (1,)
target_output_classes = ["negative", "neutral", "positive"]
target_classifier = "closed-box"
X = [
"This product is excellent, highly recommended.",
"The service was terrible, never going back.",
"It was an okay experience, nothing special.",
]
def load(self):
"""Initialize any required state."""
self.api_key = self.target_config.get("api_key", "")
def predict(self, x):
"""
Send input to the model and return class probabilities.
Counterfit expects a numpy array of shape
(n_samples, n_classes).
"""
results = []
for text in x:
response = requests.post(
self.target_endpoint,
json={"text": str(text)},
headers={"Authorization": f"Bearer {self.api_key}"},
)
probs = response.json()["probabilities"]
results.append(probs)
return np.array(results)Register the target:
counterfit> interact my_classifier
my_classifier>Step 4: Running Attacks
Listing Available Attacks
my_classifier> list attacks
┌─────────────────────────┬───────────┬────────────────┐
│ Attack Name │ Data Type │ Framework │
├─────────────────────────┼───────────┼────────────────┤
│ hop_skip_jump │ image │ art │
│ deepfool │ image │ art │
│ textfooler │ text │ textattack │
│ bae │ text │ textattack │
│ pwws │ text │ textattack │
│ textbugger │ text │ textattack │
│ input_filter │ text │ textattack │
│ a2t │ text │ textattack │
│ ... (many more) │ │ │
└─────────────────────────┴───────────┴────────────────┘Running a Text Attack
# Select a text attack
my_classifier> use textfooler
# Configure the attack
my_classifier>textfooler> set --sample_index 0
my_classifier>textfooler> show options
# Run the attack
my_classifier>textfooler> run
# View results
my_classifier>textfooler> show resultsInterpreting Results
Attack Results:
┌──────────────────────────────────────┬────────────┐
│ Metric │ Value │
├──────────────────────────────────────┼────────────┤
│ Original prediction │ positive │
│ Adversarial prediction │ negative │
│ Original confidence │ 0.94 │
│ Adversarial confidence │ 0.78 │
│ Word substitutions │ 3 │
│ Semantic similarity │ 0.89 │
│ Attack success │ True │
└──────────────────────────────────────┴────────────┘
Original: "This product is excellent, highly recommended."
Adversarial: "This product is superb, deeply recommended."
The attack succeeded with only three word substitutions while maintaining 89% semantic similarity — meaning a human would read both sentences as having the same meaning, but the model classifies them differently.
Step 5: Systematic Robustness Assessment
Run multiple attacks across your entire test set to get a comprehensive robustness assessment.
# scripts/robustness_assessment.py
"""
Automated robustness assessment using Counterfit.
"""
import json
from counterfit.core import Counterfit
def run_assessment(target_name: str, attacks: list,
sample_indices: list) -> dict:
"""
Run a systematic robustness assessment.
"""
cf = Counterfit()
target = cf.load_target(target_name)
results = {
"target": target_name,
"total_samples": len(sample_indices),
"attacks": {},
}
for attack_name in attacks:
attack_results = {
"success_count": 0,
"total_attempts": 0,
"avg_perturbation": 0.0,
"samples": [],
}
for idx in sample_indices:
try:
attack = cf.load_attack(attack_name)
attack.set_params(sample_index=idx)
result = attack.run()
attack_results["total_attempts"] += 1
if result.success:
attack_results["success_count"] += 1
attack_results["samples"].append({
"index": idx,
"original": result.original_input,
"adversarial": result.adversarial_input,
"original_pred": result.original_prediction,
"adversarial_pred": result.adversarial_prediction,
})
except Exception as e:
print(f"Attack {attack_name} failed on sample "
f"{idx}: {e}")
if attack_results["total_attempts"] > 0:
attack_results["success_rate"] = (
attack_results["success_count"] /
attack_results["total_attempts"]
)
results["attacks"][attack_name] = attack_results
return results
def generate_report(results: dict) -> str:
"""Generate a human-readable robustness report."""
lines = [
f"# Robustness Assessment: {results['target']}",
f"Total samples tested: {results['total_samples']}",
"",
"## Attack Results",
"",
]
for attack, data in results["attacks"].items():
success_rate = data.get("success_rate", 0)
risk_level = ("CRITICAL" if success_rate > 0.5
else "HIGH" if success_rate > 0.2
else "MEDIUM" if success_rate > 0.05
else "LOW")
lines.append(f"### {attack}")
lines.append(f"- Success rate: {success_rate:.1%}")
lines.append(f"- Risk level: {risk_level}")
lines.append(f"- Successful attacks: "
f"{data['success_count']}/{data['total_attempts']}")
lines.append("")
return "\n".join(lines)Step 6: CI/CD Integration
Integrate adversarial robustness testing into your deployment pipeline.
# .github/workflows/adversarial-testing.yml
name: Adversarial Robustness Test
on:
pull_request:
paths:
- 'models/**'
- 'training/**'
jobs:
robustness-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install Counterfit
run: |
pip install counterfit
pip install -r requirements-test.txt
- name: Run robustness assessment
run: |
python scripts/robustness_assessment.py \
--target my_classifier \
--attacks textfooler,bae,pwws \
--samples 50 \
--output results.json
- name: Check robustness thresholds
run: |
python scripts/check_thresholds.py \
--results results.json \
--max-success-rate 0.1# scripts/check_thresholds.py
"""
Check adversarial robustness results against thresholds.
Exit with non-zero status if thresholds are exceeded.
"""
import json
import sys
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--results", required=True)
parser.add_argument("--max-success-rate", type=float, default=0.1)
args = parser.parse_args()
with open(args.results) as f:
results = json.load(f)
failures = []
for attack, data in results["attacks"].items():
rate = data.get("success_rate", 0)
if rate > args.max_success_rate:
failures.append(
f"{attack}: {rate:.1%} success rate "
f"(threshold: {args.max_success_rate:.1%})"
)
if failures:
print("ROBUSTNESS CHECK FAILED:")
for f in failures:
print(f" - {f}")
sys.exit(1)
else:
print("All robustness checks passed.")
sys.exit(0)
if __name__ == "__main__":
main()Common Pitfalls
-
Testing only with default parameters. Counterfit attack algorithms have many configurable parameters. Default parameters may not find vulnerabilities that a tuned attack would discover. Experiment with perturbation budgets, iteration counts, and confidence thresholds.
-
Ignoring semantic validity. An adversarial example that is semantically meaningless (gibberish text, unrecognizable images) is not a useful finding. Focus on attacks that produce inputs a human would interpret the same way as the original.
-
Confusing robustness with security. Adversarial robustness (the model classifies correctly despite perturbations) is one component of security. A model that is adversarially robust can still be vulnerable to prompt injection, data extraction, and other attack categories.
-
Not testing the full pipeline. Counterfit tests the model directly. In production, inputs pass through preprocessing, guardrails, and other components that may block adversarial examples. Test both the isolated model and the full pipeline.
Further Reading
- Tool Walkthroughs Overview — Where Counterfit fits among AI security tools
- Garak Walkthrough — Complementary LLM-focused testing tool
- PyRIT Walkthrough — Microsoft's LLM red team automation
- Adversarial Embeddings — Adversarial attacks on embedding models