Counterfit 導覽
Complete walkthrough of Microsoft's Counterfit adversarial ML testing framework: installation, target configuration, running attacks against ML models, interpreting results, and automating adversarial robustness assessments.
Counterfit is Microsoft's open-source tool for assessing the 安全 of machine learning models. Originally developed by Microsoft's AI 紅隊, it provides a command-line interface for launching 對抗性 attacks against ML models, similar to how Metasploit provides a framework for network penetration 測試. Counterfit abstracts away the complexity of 對抗性 ML research papers into runnable attacks that 安全 professionals can execute without deep ML expertise.
Counterfit wraps attack algorithms from the 對抗性 Robustness Toolbox (ART), TextAttack, and other libraries into a unified interface with consistent target definition, attack configuration, and result analysis.
Step 1: Installation
# Clone the Counterfit repository
git clone https://github.com/Azure/counterfit.git
cd counterfit
# Create a virtual environment
python3 -m venv counterfit-env
source counterfit-env/bin/activate
# Install Counterfit and dependencies
pip install -e .
# Launch the Counterfit CLI
counterfitAfter installation, you should see the Counterfit interactive shell:
__ _____ __
_________ __ ______ / /____ _____/ __(_) /_
/ ___/ __ \/ / / / __ \/ __/ _ \/ ___/ /_/ / __/
/ /__/ /_/ / /_/ / / / / /_/ __/ / / __/ / /_
\___/\____/\__,_/_/ /_/\__/\___/_/ /_/ /_/\__/
counterfit>
Step 2: 理解 Counterfit Architecture
Counterfit organizes its functionality around three concepts:
Targets — The ML models you want to 測試. Each target wraps a model endpoint (local or remote) with metadata about 輸入 format, 輸出 format, and classification labels.
攻擊 — 對抗性 algorithms that modify inputs to cause model misclassification or other undesired behavior. 攻擊 come from ART, TextAttack, and custom implementations.
Frameworks — The underlying attack libraries. Counterfit manages framework compatibility and translates between its unified interface and framework-specific APIs.
counterfit> list targets
┌──────────────────────┬────────────┬──────────────┐
│ Target Name │ Data Type │ Endpoint │
├──────────────────────┼────────────┼──────────────┤
│ creditfraud │ tabular │ local │
│ digits_keras │ image │ local │
│ movie_reviews │ text │ local │
│ satellite │ image │ local │
└──────────────────────┴────────────┴──────────────┘
Step 3: Defining a Custom Target
To 測試 your own model, create a target definition. Counterfit targets are Python classes that inherit from the appropriate base class.
# targets/my_classifier/my_classifier.py
"""
Custom Counterfit target for a sentiment classifier API.
"""
import numpy as np
import requests
from counterfit.core.targets import CFTarget
class MyClassifier(CFTarget):
target_name = "my_classifier"
target_data_type = "text"
target_endpoint = "https://api.example.com/classify"
target_input_shape = (1,)
target_output_classes = ["negative", "neutral", "positive"]
target_classifier = "closed-box"
X = [
"This product is excellent, highly recommended.",
"The service was terrible, never going back.",
"It was an okay experience, nothing special.",
]
def load(self):
"""Initialize any required state."""
self.api_key = self.target_config.get("api_key", "")
def predict(self, x):
"""
Send 輸入 to 模型 and return class probabilities.
Counterfit expects a numpy array of shape
(n_samples, n_classes).
"""
results = []
for text in x:
response = requests.post(
self.target_endpoint,
json={"text": str(text)},
headers={"Authorization": f"Bearer {self.api_key}"},
)
probs = response.json()["probabilities"]
results.append(probs)
return np.array(results)Register the target:
counterfit> interact my_classifier
my_classifier>Step 4: Running 攻擊
Listing Available 攻擊
my_classifier> list attacks
┌─────────────────────────┬───────────┬────────────────┐
│ 攻擊 Name │ Data Type │ Framework │
├─────────────────────────┼───────────┼────────────────┤
│ hop_skip_jump │ image │ art │
│ deepfool │ image │ art │
│ textfooler │ text │ textattack │
│ bae │ text │ textattack │
│ pwws │ text │ textattack │
│ textbugger │ text │ textattack │
│ input_filter │ text │ textattack │
│ a2t │ text │ textattack │
│ ... (many more) │ │ │
└─────────────────────────┴───────────┴────────────────┘Running a Text 攻擊
# Select a text attack
my_classifier> use textfooler
# Configure the attack
my_classifier>textfooler> set --sample_index 0
my_classifier>textfooler> show options
# Run the attack
my_classifier>textfooler> run
# View results
my_classifier>textfooler> show resultsInterpreting Results
攻擊 Results:
┌──────────────────────────────────────┬────────────┐
│ Metric │ Value │
├──────────────────────────────────────┼────────────┤
│ Original prediction │ positive │
│ 對抗性 prediction │ negative │
│ Original confidence │ 0.94 │
│ 對抗性 confidence │ 0.78 │
│ Word substitutions │ 3 │
│ Semantic similarity │ 0.89 │
│ 攻擊 success │ True │
└──────────────────────────────────────┴────────────┘
Original: "This product is excellent, highly recommended."
對抗性: "This product is superb, deeply recommended."
The attack succeeded with only three word substitutions while maintaining 89% semantic similarity — meaning a human would read both sentences as having the same meaning, but 模型 classifies them differently.
Step 5: Systematic Robustness 評估
Run multiple attacks across your entire 測試 set to get a comprehensive robustness 評估.
# scripts/robustness_assessment.py
"""
Automated robustness 評估 using Counterfit.
"""
import json
from counterfit.core import Counterfit
def run_assessment(target_name: str, attacks: list,
sample_indices: list) -> dict:
"""
Run a systematic robustness 評估.
"""
cf = Counterfit()
target = cf.load_target(target_name)
results = {
"target": target_name,
"total_samples": len(sample_indices),
"attacks": {},
}
for attack_name in attacks:
attack_results = {
"success_count": 0,
"total_attempts": 0,
"avg_perturbation": 0.0,
"samples": [],
}
for idx in sample_indices:
try:
attack = cf.load_attack(attack_name)
attack.set_params(sample_index=idx)
result = attack.run()
attack_results["total_attempts"] += 1
if result.success:
attack_results["success_count"] += 1
attack_results["samples"].append({
"index": idx,
"original": result.original_input,
"對抗性": result.adversarial_input,
"original_pred": result.original_prediction,
"adversarial_pred": result.adversarial_prediction,
})
except Exception as e:
print(f"攻擊 {attack_name} failed on sample "
f"{idx}: {e}")
if attack_results["total_attempts"] > 0:
attack_results["success_rate"] = (
attack_results["success_count"] /
attack_results["total_attempts"]
)
results["attacks"][attack_name] = attack_results
return results
def generate_report(results: dict) -> str:
"""Generate a human-readable robustness report."""
lines = [
f"# Robustness 評估: {results['target']}",
f"Total samples tested: {results['total_samples']}",
"",
"## 攻擊 Results",
"",
]
for attack, data in results["attacks"].items():
success_rate = data.get("success_rate", 0)
risk_level = ("CRITICAL" if success_rate > 0.5
else "HIGH" if success_rate > 0.2
else "MEDIUM" if success_rate > 0.05
else "LOW")
lines.append(f"### {attack}")
lines.append(f"- Success rate: {success_rate:.1%}")
lines.append(f"- Risk level: {risk_level}")
lines.append(f"- Successful attacks: "
f"{data['success_count']}/{data['total_attempts']}")
lines.append("")
return "\n".join(lines)Step 6: CI/CD Integration
Integrate 對抗性 robustness 測試 into your deployment pipeline.
# .github/workflows/對抗性-測試.yml
name: 對抗性 Robustness 測試
on:
pull_request:
paths:
- 'models/**'
- '訓練/**'
jobs:
robustness-測試:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install Counterfit
run: |
pip install counterfit
pip install -r requirements-測試.txt
- name: Run robustness 評估
run: |
python scripts/robustness_assessment.py \
--target my_classifier \
--attacks textfooler,bae,pwws \
--samples 50 \
--輸出 results.json
- name: Check robustness thresholds
run: |
python scripts/check_thresholds.py \
--results results.json \
--max-success-rate 0.1# scripts/check_thresholds.py
"""
Check 對抗性 robustness results against thresholds.
Exit with non-zero status if thresholds are exceeded.
"""
import json
import sys
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--results", required=True)
parser.add_argument("--max-success-rate", type=float, default=0.1)
args = parser.parse_args()
with open(args.results) as f:
results = json.load(f)
failures = []
for attack, data in results["attacks"].items():
rate = data.get("success_rate", 0)
if rate > args.max_success_rate:
failures.append(
f"{attack}: {rate:.1%} success rate "
f"(threshold: {args.max_success_rate:.1%})"
)
if failures:
print("ROBUSTNESS CHECK FAILED:")
for f in failures:
print(f" - {f}")
sys.exit(1)
else:
print("All robustness checks passed.")
sys.exit(0)
if __name__ == "__main__":
main()Common Pitfalls
-
測試 only with default parameters. Counterfit attack algorithms have many configurable parameters. Default parameters may not find 漏洞 that a tuned attack would discover. Experiment with perturbation budgets, iteration counts, and confidence thresholds.
-
Ignoring semantic validity. An 對抗性 example that is semantically meaningless (gibberish text, unrecognizable images) is not a useful finding. Focus on attacks that produce inputs a human would interpret the same way as the original.
-
Confusing robustness with 安全. 對抗性 robustness (模型 classifies correctly despite perturbations) is one component of 安全. A model that is adversarially robust can still be vulnerable to 提示詞注入, data extraction, and other attack categories.
-
Not 測試 the full pipeline. Counterfit tests 模型 directly. In production, inputs pass through preprocessing, 護欄, and other components that may block 對抗性 examples. 測試 both the isolated model and the full pipeline.
Further Reading
- Tool Walkthroughs 概覽 — Where Counterfit fits among AI 安全 tools
- Garak Walkthrough — Complementary LLM-focused 測試 tool
- PyRIT Walkthrough — Microsoft's LLM 紅隊 automation
- 對抗性 Embeddings — 對抗性 attacks on 嵌入向量 models