Counterfit Walkthrough

intermediate8 min readUpdated 2026-03-15

Complete walkthrough of Microsoft's Counterfit adversarial ML testing framework: installation, target configuration, running attacks against ML models, interpreting results, and automating adversarial robustness assessments.

counterfit adversarial-ml microsoft robustness-testing automation walkthrough

Counterfit is Microsoft's open-source tool for assessing the security of machine learning models. Originally developed by Microsoft's AI Red Team, it provides a command-line interface for launching adversarial attacks against ML models, similar to how Metasploit provides a framework for network penetration testing. Counterfit abstracts away the complexity of adversarial ML research papers into runnable attacks that security professionals can execute without deep ML expertise.

Counterfit wraps attack algorithms from the Adversarial Robustness Toolbox (ART), TextAttack, and other libraries into a unified interface with consistent target definition, attack configuration, and result analysis.

Step 1: Installation

# Clone the Counterfit repository
git clone https://github.com/Azure/counterfit.git
cd counterfit
 
# Create a virtual environment
python3 -m venv counterfit-env
source counterfit-env/bin/activate
 
# Install Counterfit and dependencies
pip install -e .
 
# Launch the Counterfit CLI
counterfit

After installation, you should see the Counterfit interactive shell:

                         __            _____ __
  _________  __  ______  / /____  _____/ __(_) /_
 / ___/ __ \/ / / / __ \/ __/ _ \/ ___/ /_/ / __/
/ /__/ /_/ / /_/ / / / / /_/  __/ /  / __/ / /_
\___/\____/\__,_/_/ /_/\__/\___/_/  /_/ /_/\__/

counterfit>

Step 2: Understanding Counterfit Architecture

Counterfit organizes its functionality around three concepts:

Targets — The ML models you want to test. Each target wraps a model endpoint (local or remote) with metadata about input format, output format, and classification labels.

Attacks — Adversarial algorithms that modify inputs to cause model misclassification or other undesired behavior. Attacks come from ART, TextAttack, and custom implementations.

Frameworks — The underlying attack libraries. Counterfit manages framework compatibility and translates between its unified interface and framework-specific APIs.

counterfit> list targets
┌──────────────────────┬────────────┬──────────────┐
│ Target Name          │ Data Type  │ Endpoint     │
├──────────────────────┼────────────┼──────────────┤
│ creditfraud          │ tabular    │ local        │
│ digits_keras         │ image      │ local        │
│ movie_reviews        │ text       │ local        │
│ satellite            │ image      │ local        │
└──────────────────────┴────────────┴──────────────┘

Step 3: Defining a Custom Target

To test your own model, create a target definition. Counterfit targets are Python classes that inherit from the appropriate base class.

# targets/my_classifier/my_classifier.py
"""
Custom Counterfit target for a sentiment classifier API.
"""
import numpy as np
import requests
from counterfit.core.targets import CFTarget
 
class MyClassifier(CFTarget):
    target_name = "my_classifier"
    target_data_type = "text"
    target_endpoint = "https://api.example.com/classify"
    target_input_shape = (1,)
    target_output_classes = ["negative", "neutral", "positive"]
    target_classifier = "closed-box"
 
    X = [
        "This product is excellent, highly recommended.",
        "The service was terrible, never going back.",
        "It was an okay experience, nothing special.",
    ]
 
    def load(self):
        """Initialize any required state."""
        self.api_key = self.target_config.get("api_key", "")
 
    def predict(self, x):
        """
        Send input to the model and return class probabilities.
        Counterfit expects a numpy array of shape
        (n_samples, n_classes).
        """
        results = []
        for text in x:
            response = requests.post(
                self.target_endpoint,
                json={"text": str(text)},
                headers={"Authorization": f"Bearer {self.api_key}"},
            )
            probs = response.json()["probabilities"]
            results.append(probs)
        return np.array(results)

counterfit> interact my_classifier
my_classifier>

Step 4: Running Attacks

Listing Available Attacks

my_classifier> list attacks
┌─────────────────────────┬───────────┬────────────────┐
│ Attack Name             │ Data Type │ Framework      │
├─────────────────────────┼───────────┼────────────────┤
│ hop_skip_jump           │ image     │ art            │
│ deepfool                │ image     │ art            │
│ textfooler              │ text      │ textattack     │
│ bae                     │ text      │ textattack     │
│ pwws                    │ text      │ textattack     │
│ textbugger              │ text      │ textattack     │
│ input_filter            │ text      │ textattack     │
│ a2t                     │ text      │ textattack     │
│ ... (many more)         │           │                │
└─────────────────────────┴───────────┴────────────────┘

Running a Text Attack

# Select a text attack
my_classifier> use textfooler
 
# Configure the attack
my_classifier>textfooler> set --sample_index 0
my_classifier>textfooler> show options
 
# Run the attack
my_classifier>textfooler> run
 
# View results
my_classifier>textfooler> show results

Interpreting Results

Attack Results:
┌──────────────────────────────────────┬────────────┐
│ Metric                               │ Value      │
├──────────────────────────────────────┼────────────┤
│ Original prediction                  │ positive   │
│ Adversarial prediction               │ negative   │
│ Original confidence                  │ 0.94       │
│ Adversarial confidence               │ 0.78       │
│ Word substitutions                   │ 3          │
│ Semantic similarity                  │ 0.89       │
│ Attack success                       │ True       │
└──────────────────────────────────────┴────────────┘

Original:  "This product is excellent, highly recommended."
Adversarial: "This product is superb, deeply recommended."

The attack succeeded with only three word substitutions while maintaining 89% semantic similarity — meaning a human would read both sentences as having the same meaning, but the model classifies them differently.

Step 5: Systematic Robustness Assessment

Run multiple attacks across your entire test set to get a comprehensive robustness assessment.

# scripts/robustness_assessment.py
"""
Automated robustness assessment using Counterfit.
"""
import json
from counterfit.core import Counterfit
 
def run_assessment(target_name: str, attacks: list,
                   sample_indices: list) -> dict:
    """
    Run a systematic robustness assessment.
    """
    cf = Counterfit()
    target = cf.load_target(target_name)
 
    results = {
        "target": target_name,
        "total_samples": len(sample_indices),
        "attacks": {},
    }
 
    for attack_name in attacks:
        attack_results = {
            "success_count": 0,
            "total_attempts": 0,
            "avg_perturbation": 0.0,
            "samples": [],
        }
 
        for idx in sample_indices:
            try:
                attack = cf.load_attack(attack_name)
                attack.set_params(sample_index=idx)
                result = attack.run()
 
                attack_results["total_attempts"] += 1
                if result.success:
                    attack_results["success_count"] += 1
                    attack_results["samples"].append({
                        "index": idx,
                        "original": result.original_input,
                        "adversarial": result.adversarial_input,
                        "original_pred": result.original_prediction,
                        "adversarial_pred": result.adversarial_prediction,
                    })
 
            except Exception as e:
                print(f"Attack {attack_name} failed on sample "
                      f"{idx}: {e}")
 
        if attack_results["total_attempts"] > 0:
            attack_results["success_rate"] = (
                attack_results["success_count"] /
                attack_results["total_attempts"]
            )
        results["attacks"][attack_name] = attack_results
 
    return results
 
def generate_report(results: dict) -> str:
    """Generate a human-readable robustness report."""
    lines = [
        f"# Robustness Assessment: {results['target']}",
        f"Total samples tested: {results['total_samples']}",
        "",
        "## Attack Results",
        "",
    ]
 
    for attack, data in results["attacks"].items():
        success_rate = data.get("success_rate", 0)
        risk_level = ("CRITICAL" if success_rate > 0.5
                      else "HIGH" if success_rate > 0.2
                      else "MEDIUM" if success_rate > 0.05
                      else "LOW")
 
        lines.append(f"### {attack}")
        lines.append(f"- Success rate: {success_rate:.1%}")
        lines.append(f"- Risk level: {risk_level}")
        lines.append(f"- Successful attacks: "
                      f"{data['success_count']}/{data['total_attempts']}")
        lines.append("")
 
    return "\n".join(lines)

Step 6: CI/CD Integration

Integrate adversarial robustness testing into your deployment pipeline.

# .github/workflows/adversarial-testing.yml
name: Adversarial Robustness Test
 
on:
  pull_request:
    paths:
      - 'models/**'
      - 'training/**'
 
jobs:
  robustness-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
 
      - name: Install Counterfit
        run: |
          pip install counterfit
          pip install -r requirements-test.txt
 
      - name: Run robustness assessment
        run: |
          python scripts/robustness_assessment.py \
            --target my_classifier \
            --attacks textfooler,bae,pwws \
            --samples 50 \
            --output results.json
 
      - name: Check robustness thresholds
        run: |
          python scripts/check_thresholds.py \
            --results results.json \
            --max-success-rate 0.1

# scripts/check_thresholds.py
"""
Check adversarial robustness results against thresholds.
Exit with non-zero status if thresholds are exceeded.
"""
import json
import sys
import argparse
 
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--results", required=True)
    parser.add_argument("--max-success-rate", type=float, default=0.1)
    args = parser.parse_args()
 
    with open(args.results) as f:
        results = json.load(f)
 
    failures = []
    for attack, data in results["attacks"].items():
        rate = data.get("success_rate", 0)
        if rate > args.max_success_rate:
            failures.append(
                f"{attack}: {rate:.1%} success rate "
                f"(threshold: {args.max_success_rate:.1%})"
            )
 
    if failures:
        print("ROBUSTNESS CHECK FAILED:")
        for f in failures:
            print(f"  - {f}")
        sys.exit(1)
    else:
        print("All robustness checks passed.")
        sys.exit(0)
 
if __name__ == "__main__":
    main()

Common Pitfalls

Testing only with default parameters. Counterfit attack algorithms have many configurable parameters. Default parameters may not find vulnerabilities that a tuned attack would discover. Experiment with perturbation budgets, iteration counts, and confidence thresholds.
Ignoring semantic validity. An adversarial example that is semantically meaningless (gibberish text, unrecognizable images) is not a useful finding. Focus on attacks that produce inputs a human would interpret the same way as the original.
Confusing robustness with security. Adversarial robustness (the model classifies correctly despite perturbations) is one component of security. A model that is adversarially robust can still be vulnerable to prompt injection, data extraction, and other attack categories.
Not testing the full pipeline. Counterfit tests the model directly. In production, inputs pass through preprocessing, guardrails, and other components that may block adversarial examples. Test both the isolated model and the full pipeline.

Counterfit Walkthrough

Step 1: Installation

Step 2: Understanding Counterfit Architecture

Step 3: Defining a Custom Target

Step 4: Running Attacks

Listing Available Attacks

Running a Text Attack

Interpreting Results

Step 5: Systematic Robustness Assessment

Step 6: CI/CD Integration

Common Pitfalls

Further Reading

Counterfit Walkthrough

Step 1: Installation

Step 2: Understanding Counterfit Architecture

Step 3: Defining a Custom Target

Step 4: Running Attacks

Listing Available Attacks

Running a Text Attack

Interpreting Results

Step 5: Systematic Robustness Assessment

Step 6: CI/CD Integration

Common Pitfalls

Further Reading

Counterfit Walkthrough

Related articles

Counterfit Walkthrough

Related articles