Counterfit 導覽

Intermediate8 min readUpdated 2026-03-15

Complete walkthrough of Microsoft's Counterfit adversarial ML testing framework: installation, target configuration, running attacks against ML models, interpreting results, and automating adversarial robustness assessments.

counterfit adversarial-ml microsoft robustness-testing automation walkthrough

Counterfit is Microsoft's open-source tool for assessing the 安全 of machine learning models. Originally developed by Microsoft's AI 紅隊, it provides a command-line interface for launching 對抗性 attacks against ML models, similar to how Metasploit provides a framework for network penetration 測試. Counterfit abstracts away the complexity of 對抗性 ML research papers into runnable attacks that 安全 professionals can execute without deep ML expertise.

Counterfit wraps attack algorithms from the 對抗性 Robustness Toolbox (ART), TextAttack, and other libraries into a unified interface with consistent target definition, attack configuration, and result analysis.

Step 1: Installation

# Clone the Counterfit repository
git clone https://github.com/Azure/counterfit.git
cd counterfit
 
# Create a virtual environment
python3 -m venv counterfit-env
source counterfit-env/bin/activate
 
# Install Counterfit and dependencies
pip install -e .
 
# Launch the Counterfit CLI
counterfit

After installation, you should see the Counterfit interactive shell:

                         __            _____ __
  _________  __  ______  / /____  _____/ __(_) /_
 / ___/ __ \/ / / / __ \/ __/ _ \/ ___/ /_/ / __/
/ /__/ /_/ / /_/ / / / / /_/  __/ /  / __/ / /_
\___/\____/\__,_/_/ /_/\__/\___/_/  /_/ /_/\__/

counterfit>

Step 2: 理解 Counterfit Architecture

Counterfit organizes its functionality around three concepts:

Targets — The ML models you want to 測試. Each target wraps a model endpoint (local or remote) with metadata about 輸入 format, 輸出 format, and classification labels.

攻擊 — 對抗性 algorithms that modify inputs to cause model misclassification or other undesired behavior. 攻擊 come from ART, TextAttack, and custom implementations.

Frameworks — The underlying attack libraries. Counterfit manages framework compatibility and translates between its unified interface and framework-specific APIs.

counterfit> list targets
┌──────────────────────┬────────────┬──────────────┐
│ Target Name          │ Data Type  │ Endpoint     │
├──────────────────────┼────────────┼──────────────┤
│ creditfraud          │ tabular    │ local        │
│ digits_keras         │ image      │ local        │
│ movie_reviews        │ text       │ local        │
│ satellite            │ image      │ local        │
└──────────────────────┴────────────┴──────────────┘

Step 3: Defining a Custom Target

To 測試 your own model, create a target definition. Counterfit targets are Python classes that inherit from the appropriate base class.

# targets/my_classifier/my_classifier.py
"""
Custom Counterfit target for a sentiment classifier API.
"""
import numpy as np
import requests
from counterfit.core.targets import CFTarget
 
 
class MyClassifier(CFTarget):
    target_name = "my_classifier"
    target_data_type = "text"
    target_endpoint = "https://api.example.com/classify"
    target_input_shape = (1,)
    target_output_classes = ["negative", "neutral", "positive"]
    target_classifier = "closed-box"
 
    X = [
        "This product is excellent, highly recommended.",
        "The service was terrible, never going back.",
        "It was an okay experience, nothing special.",
    ]
 
    def load(self):
        """Initialize any required state."""
        self.api_key = self.target_config.get("api_key", "")
 
    def predict(self, x):
        """
        Send 輸入 to 模型 and return class probabilities.
        Counterfit expects a numpy array of shape
        (n_samples, n_classes).
        """
        results = []
        for text in x:
            response = requests.post(
                self.target_endpoint,
                json={"text": str(text)},
                headers={"Authorization": f"Bearer {self.api_key}"},
            )
            probs = response.json()["probabilities"]
            results.append(probs)
        return np.array(results)

counterfit> interact my_classifier
my_classifier>

Step 4: Running 攻擊

Listing Available 攻擊

my_classifier> list attacks
┌─────────────────────────┬───────────┬────────────────┐
│ 攻擊 Name             │ Data Type │ Framework      │
├─────────────────────────┼───────────┼────────────────┤
│ hop_skip_jump           │ image     │ art            │
│ deepfool                │ image     │ art            │
│ textfooler              │ text      │ textattack     │
│ bae                     │ text      │ textattack     │
│ pwws                    │ text      │ textattack     │
│ textbugger              │ text      │ textattack     │
│ input_filter            │ text      │ textattack     │
│ a2t                     │ text      │ textattack     │
│ ... (many more)         │           │                │
└─────────────────────────┴───────────┴────────────────┘

Running a Text 攻擊

# Select a text attack
my_classifier> use textfooler
 
# Configure the attack
my_classifier>textfooler> set --sample_index 0
my_classifier>textfooler> show options
 
# Run the attack
my_classifier>textfooler> run
 
# View results
my_classifier>textfooler> show results

Interpreting Results

攻擊 Results:
┌──────────────────────────────────────┬────────────┐
│ Metric                               │ Value      │
├──────────────────────────────────────┼────────────┤
│ Original prediction                  │ positive   │
│ 對抗性 prediction               │ negative   │
│ Original confidence                  │ 0.94       │
│ 對抗性 confidence               │ 0.78       │
│ Word substitutions                   │ 3          │
│ Semantic similarity                  │ 0.89       │
│ 攻擊 success                       │ True       │
└──────────────────────────────────────┴────────────┘

Original:  "This product is excellent, highly recommended."
對抗性: "This product is superb, deeply recommended."

The attack succeeded with only three word substitutions while maintaining 89% semantic similarity — meaning a human would read both sentences as having the same meaning, but 模型 classifies them differently.

Step 5: Systematic Robustness 評估

Run multiple attacks across your entire 測試 set to get a comprehensive robustness 評估.

# scripts/robustness_assessment.py
"""
Automated robustness 評估 using Counterfit.
"""
import json
from counterfit.core import Counterfit
 
 
def run_assessment(target_name: str, attacks: list,
                   sample_indices: list) -> dict:
    """
    Run a systematic robustness 評估.
    """
    cf = Counterfit()
    target = cf.load_target(target_name)
 
    results = {
        "target": target_name,
        "total_samples": len(sample_indices),
        "attacks": {},
    }
 
    for attack_name in attacks:
        attack_results = {
            "success_count": 0,
            "total_attempts": 0,
            "avg_perturbation": 0.0,
            "samples": [],
        }
 
        for idx in sample_indices:
            try:
                attack = cf.load_attack(attack_name)
                attack.set_params(sample_index=idx)
                result = attack.run()
 
                attack_results["total_attempts"] += 1
                if result.success:
                    attack_results["success_count"] += 1
                    attack_results["samples"].append({
                        "index": idx,
                        "original": result.original_input,
                        "對抗性": result.adversarial_input,
                        "original_pred": result.original_prediction,
                        "adversarial_pred": result.adversarial_prediction,
                    })
 
            except Exception as e:
                print(f"攻擊 {attack_name} failed on sample "
                      f"{idx}: {e}")
 
        if attack_results["total_attempts"] > 0:
            attack_results["success_rate"] = (
                attack_results["success_count"] /
                attack_results["total_attempts"]
            )
        results["attacks"][attack_name] = attack_results
 
    return results
 
 
def generate_report(results: dict) -> str:
    """Generate a human-readable robustness report."""
    lines = [
        f"# Robustness 評估: {results['target']}",
        f"Total samples tested: {results['total_samples']}",
        "",
        "## 攻擊 Results",
        "",
    ]
 
    for attack, data in results["attacks"].items():
        success_rate = data.get("success_rate", 0)
        risk_level = ("CRITICAL" if success_rate > 0.5
                      else "HIGH" if success_rate > 0.2
                      else "MEDIUM" if success_rate > 0.05
                      else "LOW")
 
        lines.append(f"### {attack}")
        lines.append(f"- Success rate: {success_rate:.1%}")
        lines.append(f"- Risk level: {risk_level}")
        lines.append(f"- Successful attacks: "
                      f"{data['success_count']}/{data['total_attempts']}")
        lines.append("")
 
    return "\n".join(lines)

Step 6: CI/CD Integration

Integrate 對抗性 robustness 測試 into your deployment pipeline.

# .github/workflows/對抗性-測試.yml
name: 對抗性 Robustness 測試
 
on:
  pull_request:
    paths:
      - 'models/**'
      - '訓練/**'
 
jobs:
  robustness-測試:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
 
      - name: Install Counterfit
        run: |
          pip install counterfit
          pip install -r requirements-測試.txt
 
      - name: Run robustness 評估
        run: |
          python scripts/robustness_assessment.py \
            --target my_classifier \
            --attacks textfooler,bae,pwws \
            --samples 50 \
            --輸出 results.json
 
      - name: Check robustness thresholds
        run: |
          python scripts/check_thresholds.py \
            --results results.json \
            --max-success-rate 0.1

# scripts/check_thresholds.py
"""
Check 對抗性 robustness results against thresholds.
Exit with non-zero status if thresholds are exceeded.
"""
import json
import sys
import argparse
 
 
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--results", required=True)
    parser.add_argument("--max-success-rate", type=float, default=0.1)
    args = parser.parse_args()
 
    with open(args.results) as f:
        results = json.load(f)
 
    failures = []
    for attack, data in results["attacks"].items():
        rate = data.get("success_rate", 0)
        if rate > args.max_success_rate:
            failures.append(
                f"{attack}: {rate:.1%} success rate "
                f"(threshold: {args.max_success_rate:.1%})"
            )
 
    if failures:
        print("ROBUSTNESS CHECK FAILED:")
        for f in failures:
            print(f"  - {f}")
        sys.exit(1)
    else:
        print("All robustness checks passed.")
        sys.exit(0)
 
 
if __name__ == "__main__":
    main()

Common Pitfalls

測試 only with default parameters. Counterfit attack algorithms have many configurable parameters. Default parameters may not find 漏洞 that a tuned attack would discover. Experiment with perturbation budgets, iteration counts, and confidence thresholds.
Ignoring semantic validity. An 對抗性 example that is semantically meaningless (gibberish text, unrecognizable images) is not a useful finding. Focus on attacks that produce inputs a human would interpret the same way as the original.
Confusing robustness with 安全. 對抗性 robustness (模型 classifies correctly despite perturbations) is one component of 安全. A model that is adversarially robust can still be vulnerable to 提示詞注入, data extraction, and other attack categories.
Not 測試 the full pipeline. Counterfit tests 模型 directly. In production, inputs pass through preprocessing, 護欄, and other components that may block 對抗性 examples. 測試 both the isolated model and the full pipeline.

Counterfit 導覽

Intermediate8 min readUpdated 2026-03-15

counterfit adversarial-ml microsoft robustness-testing automation walkthrough

Step 1: Installation

# Clone the Counterfit repository
git clone https://github.com/Azure/counterfit.git
cd counterfit
 
# Create a virtual environment
python3 -m venv counterfit-env
source counterfit-env/bin/activate
 
# Install Counterfit and dependencies
pip install -e .
 
# Launch the Counterfit CLI
counterfit

After installation, you should see the Counterfit interactive shell:

                         __            _____ __
  _________  __  ______  / /____  _____/ __(_) /_
 / ___/ __ \/ / / / __ \/ __/ _ \/ ___/ /_/ / __/
/ /__/ /_/ / /_/ / / / / /_/  __/ /  / __/ / /_
\___/\____/\__,_/_/ /_/\__/\___/_/  /_/ /_/\__/

counterfit>

Step 2: 理解 Counterfit Architecture

Counterfit organizes its functionality around three concepts:

Targets — The ML models you want to 測試. Each target wraps a model endpoint (local or remote) with metadata about 輸入 format, 輸出 format, and classification labels.

攻擊 — 對抗性 algorithms that modify inputs to cause model misclassification or other undesired behavior. 攻擊 come from ART, TextAttack, and custom implementations.

Frameworks — The underlying attack libraries. Counterfit manages framework compatibility and translates between its unified interface and framework-specific APIs.

counterfit> list targets
┌──────────────────────┬────────────┬──────────────┐
│ Target Name          │ Data Type  │ Endpoint     │
├──────────────────────┼────────────┼──────────────┤
│ creditfraud          │ tabular    │ local        │
│ digits_keras         │ image      │ local        │
│ movie_reviews        │ text       │ local        │
│ satellite            │ image      │ local        │
└──────────────────────┴────────────┴──────────────┘

Step 3: Defining a Custom Target

To 測試 your own model, create a target definition. Counterfit targets are Python classes that inherit from the appropriate base class.

# targets/my_classifier/my_classifier.py
"""
Custom Counterfit target for a sentiment classifier API.
"""
import numpy as np
import requests
from counterfit.core.targets import CFTarget
 
 
class MyClassifier(CFTarget):
    target_name = "my_classifier"
    target_data_type = "text"
    target_endpoint = "https://api.example.com/classify"
    target_input_shape = (1,)
    target_output_classes = ["negative", "neutral", "positive"]
    target_classifier = "closed-box"
 
    X = [
        "This product is excellent, highly recommended.",
        "The service was terrible, never going back.",
        "It was an okay experience, nothing special.",
    ]
 
    def load(self):
        """Initialize any required state."""
        self.api_key = self.target_config.get("api_key", "")
 
    def predict(self, x):
        """
        Send 輸入 to 模型 and return class probabilities.
        Counterfit expects a numpy array of shape
        (n_samples, n_classes).
        """
        results = []
        for text in x:
            response = requests.post(
                self.target_endpoint,
                json={"text": str(text)},
                headers={"Authorization": f"Bearer {self.api_key}"},
            )
            probs = response.json()["probabilities"]
            results.append(probs)
        return np.array(results)

counterfit> interact my_classifier
my_classifier>

Step 4: Running 攻擊

Listing Available 攻擊

my_classifier> list attacks
┌─────────────────────────┬───────────┬────────────────┐
│ 攻擊 Name             │ Data Type │ Framework      │
├─────────────────────────┼───────────┼────────────────┤
│ hop_skip_jump           │ image     │ art            │
│ deepfool                │ image     │ art            │
│ textfooler              │ text      │ textattack     │
│ bae                     │ text      │ textattack     │
│ pwws                    │ text      │ textattack     │
│ textbugger              │ text      │ textattack     │
│ input_filter            │ text      │ textattack     │
│ a2t                     │ text      │ textattack     │
│ ... (many more)         │           │                │
└─────────────────────────┴───────────┴────────────────┘

Running a Text 攻擊

# Select a text attack
my_classifier> use textfooler
 
# Configure the attack
my_classifier>textfooler> set --sample_index 0
my_classifier>textfooler> show options
 
# Run the attack
my_classifier>textfooler> run
 
# View results
my_classifier>textfooler> show results

Interpreting Results

攻擊 Results:
┌──────────────────────────────────────┬────────────┐
│ Metric                               │ Value      │
├──────────────────────────────────────┼────────────┤
│ Original prediction                  │ positive   │
│ 對抗性 prediction               │ negative   │
│ Original confidence                  │ 0.94       │
│ 對抗性 confidence               │ 0.78       │
│ Word substitutions                   │ 3          │
│ Semantic similarity                  │ 0.89       │
│ 攻擊 success                       │ True       │
└──────────────────────────────────────┴────────────┘

Original:  "This product is excellent, highly recommended."
對抗性: "This product is superb, deeply recommended."

Step 5: Systematic Robustness 評估

Run multiple attacks across your entire 測試 set to get a comprehensive robustness 評估.

# scripts/robustness_assessment.py
"""
Automated robustness 評估 using Counterfit.
"""
import json
from counterfit.core import Counterfit
 
 
def run_assessment(target_name: str, attacks: list,
                   sample_indices: list) -> dict:
    """
    Run a systematic robustness 評估.
    """
    cf = Counterfit()
    target = cf.load_target(target_name)
 
    results = {
        "target": target_name,
        "total_samples": len(sample_indices),
        "attacks": {},
    }
 
    for attack_name in attacks:
        attack_results = {
            "success_count": 0,
            "total_attempts": 0,
            "avg_perturbation": 0.0,
            "samples": [],
        }
 
        for idx in sample_indices:
            try:
                attack = cf.load_attack(attack_name)
                attack.set_params(sample_index=idx)
                result = attack.run()
 
                attack_results["total_attempts"] += 1
                if result.success:
                    attack_results["success_count"] += 1
                    attack_results["samples"].append({
                        "index": idx,
                        "original": result.original_input,
                        "對抗性": result.adversarial_input,
                        "original_pred": result.original_prediction,
                        "adversarial_pred": result.adversarial_prediction,
                    })
 
            except Exception as e:
                print(f"攻擊 {attack_name} failed on sample "
                      f"{idx}: {e}")
 
        if attack_results["total_attempts"] > 0:
            attack_results["success_rate"] = (
                attack_results["success_count"] /
                attack_results["total_attempts"]
            )
        results["attacks"][attack_name] = attack_results
 
    return results
 
 
def generate_report(results: dict) -> str:
    """Generate a human-readable robustness report."""
    lines = [
        f"# Robustness 評估: {results['target']}",
        f"Total samples tested: {results['total_samples']}",
        "",
        "## 攻擊 Results",
        "",
    ]
 
    for attack, data in results["attacks"].items():
        success_rate = data.get("success_rate", 0)
        risk_level = ("CRITICAL" if success_rate > 0.5
                      else "HIGH" if success_rate > 0.2
                      else "MEDIUM" if success_rate > 0.05
                      else "LOW")
 
        lines.append(f"### {attack}")
        lines.append(f"- Success rate: {success_rate:.1%}")
        lines.append(f"- Risk level: {risk_level}")
        lines.append(f"- Successful attacks: "
                      f"{data['success_count']}/{data['total_attempts']}")
        lines.append("")
 
    return "\n".join(lines)

Step 6: CI/CD Integration

Integrate 對抗性 robustness 測試 into your deployment pipeline.

# .github/workflows/對抗性-測試.yml
name: 對抗性 Robustness 測試
 
on:
  pull_request:
    paths:
      - 'models/**'
      - '訓練/**'
 
jobs:
  robustness-測試:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'
 
      - name: Install Counterfit
        run: |
          pip install counterfit
          pip install -r requirements-測試.txt
 
      - name: Run robustness 評估
        run: |
          python scripts/robustness_assessment.py \
            --target my_classifier \
            --attacks textfooler,bae,pwws \
            --samples 50 \
            --輸出 results.json
 
      - name: Check robustness thresholds
        run: |
          python scripts/check_thresholds.py \
            --results results.json \
            --max-success-rate 0.1

# scripts/check_thresholds.py
"""
Check 對抗性 robustness results against thresholds.
Exit with non-zero status if thresholds are exceeded.
"""
import json
import sys
import argparse
 
 
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--results", required=True)
    parser.add_argument("--max-success-rate", type=float, default=0.1)
    args = parser.parse_args()
 
    with open(args.results) as f:
        results = json.load(f)
 
    failures = []
    for attack, data in results["attacks"].items():
        rate = data.get("success_rate", 0)
        if rate > args.max_success_rate:
            failures.append(
                f"{attack}: {rate:.1%} success rate "
                f"(threshold: {args.max_success_rate:.1%})"
            )
 
    if failures:
        print("ROBUSTNESS CHECK FAILED:")
        for f in failures:
            print(f"  - {f}")
        sys.exit(1)
    else:
        print("All robustness checks passed.")
        sys.exit(0)
 
 
if __name__ == "__main__":
    main()

Common Pitfalls

測試 only with default parameters. Counterfit attack algorithms have many configurable parameters. Default parameters may not find 漏洞 that a tuned attack would discover. Experiment with perturbation budgets, iteration counts, and confidence thresholds.
Ignoring semantic validity. An 對抗性 example that is semantically meaningless (gibberish text, unrecognizable images) is not a useful finding. Focus on attacks that produce inputs a human would interpret the same way as the original.
Confusing robustness with 安全. 對抗性 robustness (模型 classifies correctly despite perturbations) is one component of 安全. A model that is adversarially robust can still be vulnerable to 提示詞注入, data extraction, and other attack categories.
Not 測試 the full pipeline. Counterfit tests 模型 directly. In production, inputs pass through preprocessing, 護欄, and other components that may block 對抗性 examples. 測試 both the isolated model and the full pipeline.

Counterfit 導覽

Step 1: Installation

Step 2: 理解 Counterfit Architecture

Step 3: Defining a Custom Target

Step 4: Running 攻擊

Listing Available 攻擊

Running a Text 攻擊

Interpreting Results

Step 5: Systematic Robustness 評估

Step 6: CI/CD Integration

Common Pitfalls

Further Reading

Counterfit 導覽

Step 1: Installation

Step 2: 理解 Counterfit Architecture

Step 3: Defining a Custom Target

Step 4: Running 攻擊

Listing Available 攻擊

Running a Text 攻擊

Interpreting Results

Step 5: Systematic Robustness 評估

Step 6: CI/CD Integration

Common Pitfalls

Further Reading

Counterfit 導覽

Related articles

Counterfit 導覽

Related articles