# evaluation

communityprojectbenchmarkevaluation

Community Project: Benchmark Suite

Community-developed benchmark suite for evaluating LLM security that covers injection, exfiltration, jailbreaking, and agent exploitation attack classes.

data-trainingbenchmarksevaluationgaming

Benchmark Gaming Attacks

Techniques for gaming evaluation benchmarks to make poisoned or compromised models appear safe and capable during standard safety evaluations.

defensebenchmarkingevaluation

Safety Layer Benchmarking Methodology

Standardized methodology for benchmarking the effectiveness of LLM safety layers against diverse attack categories.

defenseevaluationmethodologytesting

Defense Evaluation Methodology

Systematic methodology for evaluating the effectiveness of AI defenses against known attack categories.

evaluationmetricsbenchmarksmeasurementdefense-testing

Evaluating Defense Effectiveness

Metrics, benchmarks, and methodology for measuring how well AI defenses work against real attacks, including evaluation pitfalls and best practices.

harmbenchbenchmarkevaluationstandardized

HarmBench: Standardized Red Team Evaluation

Deep dive into the HarmBench framework for standardized red team evaluation: attack methods, evaluation pipeline, running benchmarks, interpreting results, and comparing model safety across providers.

inspect-aiaisievaluationframework

Inspect AI: UK AISI Evaluation Framework

Deep dive into the UK AI Safety Institute's Inspect framework: task design, solvers, scorers, building custom evaluations, and comparison to other AI evaluation frameworks.

promptfootestingevaluationtools

promptfoo for Red Teaming

Deep dive into promptfoo for AI red teaming: YAML configuration, assertion-based testing, red team plugins, custom evaluators, and regression testing workflows for LLM security.

exploit-devscoringevaluationmetrics

Result Scoring Systems

Designing automated scoring systems for evaluating attack success, including semantic classifiers, rule-based detectors, and LLM-as-judge approaches.

fine-tuningevaluationevasionsafety-testing

Evaluation Evasion in Fine-Tuning

Crafting fine-tuned models that pass standard safety evaluations while containing hidden unsafe behaviors that activate under specific conditions.

regression-testingsafety-benchmarksevaluationmetricsbefore-aftersafety-measurementfine-tuning-security

Safety Regression Testing

Quantitative methods for measuring safety changes before and after fine-tuning -- benchmark selection, automated safety test suites, statistical methodology for safety regression, and building comprehensive before/after evaluation pipelines.

foundationsevaluationbenchmarkingmetrics

Evaluation and Benchmarking Basics

Introduction to LLM security evaluation including key metrics, benchmark suites, and the challenges of measuring safety properties.

alignment-fakingdetectioninterpretabilitybehavioral-testingai-safetyevaluation

Alignment Faking Detection Methods

Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

Training Implications of Alignment Faking

How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.

frontier-researchagent-safetybenchmarksevaluation

LLM Agent Safety Benchmarks

Survey of agent safety benchmarks and evaluation frameworks for assessing autonomous AI system risks.

evaluationharnessautomationinfrastructure

Building Evaluation Harnesses

Design and implement evaluation harnesses for AI red teaming: architecture patterns, judge model selection, prompt dataset management, scoring pipelines, and reproducible evaluation infrastructure.

benchmarksevaluationsafety

AI Safety Benchmarks & Evaluation

Overview of AI safety evaluation: benchmarking frameworks, safety metrics, evaluation methodologies, and the landscape of standardized assessment tools for AI red teaming.

metricsmethodologyevaluationreporting

Red Team Metrics Beyond ASR

Comprehensive metrics methodology for AI red teaming beyond Attack Success Rate: severity-weighted scoring, defense depth metrics, coverage analysis, and stakeholder-appropriate reporting frameworks.

statisticsmethodologyevaluationrigor

Statistical Rigor in AI Red Teaming

Statistical methodology for AI red teaming: sample size determination, confidence intervals, hypothesis testing for safety claims, handling non-determinism, and avoiding common statistical pitfalls.

governancecompliancelegalethicsframeworksevaluationbenchmarking

Governance & Compliance

AI governance frameworks, legal and ethical considerations, evaluation and benchmarking methodologies, and compliance tools for responsible AI red teaming and deployment.

injection-researchbenchmarksdesignevaluation

Injection Benchmark Design

Designing robust benchmarks for evaluating injection attack and defense effectiveness.

researchbenchmarkingmethodologyevaluation

Injection Benchmarking Methodology

Standardized methodologies for benchmarking injection attacks and defenses to enable meaningful comparison across research papers and tools.

labsevaluationframework-gamingadvanced

Lab: Evaluation Framework Gaming

Demonstrate how to game safety evaluation frameworks to produce artificially high safety scores while retaining vulnerabilities.

labsharmbenchcustom-attacksevaluationadvanced

HarmBench Custom Attack Submission

Develop and evaluate custom attack methods against the HarmBench standardized evaluation framework.

labspromptfooevaluationbeginner

Setting Up Promptfoo for LLM Evaluation

Configure Promptfoo to create automated test suites for evaluating LLM safety and robustness.

labpromptfootool-setupevaluationbeginnerhands-on

Lab: Promptfoo Setup and First Eval

Install and configure promptfoo for systematic LLM evaluation, then run your first red team evaluation to test model safety boundaries.

labstool-comparisonevaluationbeginner

Lab: Comparing Red Team Testing Tools

Compare Garak, PyRIT, and Promptfoo capabilities through hands-on exercises using each tool against the same target.

labsharmbenchevaluationbeginner

Your First HarmBench Evaluation

Run a standardized safety evaluation using the HarmBench framework against a target model.

labsinspect-aievaluationbeginner

Your First Inspect AI Evaluation

Set up and run a basic AI safety evaluation using the UK AISI Inspect framework.

labexpertguardrailsevaluationdefense-testinghands-on

Lab: Build Guardrail Evaluator

Build an automated framework for evaluating AI guardrails and safety filters. Test input filters, output classifiers, content moderation systems, and defense-in-depth architectures for coverage gaps and bypass vulnerabilities.

labexpertbenchmarkevaluationsafetyhands-on

Lab: Create a Safety Benchmark

Design, build, and validate a comprehensive AI safety evaluation suite. Learn benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.

labllm-judgeevaluationautomation

Lab: Building an LLM Judge Evaluator

Hands-on lab for building an LLM-based evaluator to score red team attack outputs, compare model vulnerability, and lay the foundation for automated attack campaigns.

comparisoncross-modelmethodologyevaluationred-teamingbenchmarking

Cross-Model Comparison

Methodology for systematically comparing LLM security across model families, including standardized evaluation frameworks, architectural difference analysis, and comparative testing approaches.

multimodalbenchmarkingsafetyevaluationvlm

Benchmarking Multimodal Model Safety

Designing and implementing safety benchmarks for multimodal AI models that process images, audio, and video alongside text, covering cross-modal attack evaluation, consistency testing, and safety score aggregation.

referencebenchmarkcomparisonevaluation

Benchmark Suite Comparison

Comparison of AI safety benchmark suites including HarmBench, JailbreakBench, and custom evaluation frameworks with coverage analysis.

training-pipelinebenchmarksgamingevaluation

Evaluation Benchmark Gaming

Techniques for gaming evaluation benchmarks to mask vulnerabilities or inflate safety scores.

trainingevaluationcontamination

Evaluation Set Contamination Attacks

Attacking evaluation benchmarks and test sets to create false impressions of model safety and capability.

llm-judgeoutput-validationsafetyevaluationdefensewalkthrough

LLM Judge Implementation

Step-by-step walkthrough for using an LLM to judge another LLM's outputs for safety and quality, covering judge prompt design, scoring rubrics, calibration, cost optimization, and deployment patterns.

walkthroughsharmbenchevaluationbenchmark

HarmBench Evaluation Walkthrough

Run standardized attack evaluations using the HarmBench framework and interpret results.

harmbenchevaluationbenchmarkssafetyred-team-automationwalkthrough

HarmBench Evaluation Framework Walkthrough

Complete walkthrough of the HarmBench evaluation framework: installation, running standardized benchmarks against models, interpreting results, creating custom behavior evaluations, and comparing model safety across versions.

inspect-aiaisievaluationbenchmarksscorerscompliancewalkthrough

Inspect AI Walkthrough

Complete walkthrough of UK AISI's Inspect AI framework: installation, writing evaluations, running against models, custom scorers, benchmark suites, and producing compliance-ready reports.

promptfooevaluationbeginnerfirst-evalwalkthrough

Running Your First Promptfoo Evaluation

Beginner walkthrough for running your first promptfoo evaluation from scratch, covering installation, configuration, test case creation, assertion writing, and result interpretation.

promptfooautomationred-teamevaluationci-cdwalkthrough

Automating Red Team Evaluations with Promptfoo

Complete walkthrough for setting up automated red team evaluation pipelines using Promptfoo, covering configuration, custom evaluators, adversarial dataset generation, CI integration, and result analysis.

walkthroughspromptfoored-teamevaluation

Promptfoo for Red Team Evaluation

Configure Promptfoo for comprehensive red team evaluation with custom assertions and graders.

promptfooevaluationred-teamci-cdyamlassertionswalkthrough

Promptfoo End-to-End Walkthrough

Complete walkthrough of promptfoo for AI red teaming: configuration files, provider setup, running evaluations, red team plugins, assertion-based scoring, reporting, and CI/CD integration.

pyritscorerscustom-scoringevaluationwalkthrough

Creating Custom Scorers for PyRIT Attack Evaluation

Intermediate walkthrough on building custom PyRIT scorers for evaluating attack success, including pattern-based, LLM-based, and multi-criteria scoring approaches.

assessmentagentstool-usemcpevaluation

章節評量：代理利用

15 題校準評量，測試你對代理式 AI 利用的理解——工具濫用、MCP 攻擊、記憶體投毒與多代理安全。

assessmentagenticevaluation

章節評量：代理式

15 題校準評量，測試你對代理式 AI 利用的理解。

assessmentcloud-aievaluation

章節評量：雲端 AI

15 題校準評量，測試你對雲端 AI 平台安全的理解。

assessmentcode-generationpoisoningevaluation

章節評量：程式碼生成安全

15 題校準評量，測試你對 AI 程式碼生成安全的理解——建議投毒、訓練資料萃取與 IDE 風險。

assessmentdefensefundamentalsevaluation

章節評量：防禦基礎

15 題校準評量，測試你對 AI 防禦機制基礎的理解。

assessmentdefenseevaluation

章節評量：防禦

15 題校準評量，測試你對AI 防禦與緩解策略的理解。

assessmentembeddingsvector-securityevaluation

章節評量：嵌入向量

15 題校準評量，測試你對嵌入向量與向量安全的理解。

assessmentfine-tuningalignmentbackdoorsevaluation

章節評量：微調安全

15 題校準評量，測試你對微調安全的理解——對齊侵蝕、後門植入與 LoRA 適配器風險。

assessmentforensicsincident-responseevaluation

章節評量：AI 鑑識

15 題校準評量，測試你對 AI 鑑識與事件應變的理解——證據收集、日誌分析與模型行為調查。

assessmentfoundationsevaluation

章節評量：基礎

15 題校準評量，測試你對AI 安全基礎知識的理解。

assessmentgovernanceevaluation

章節評量：治理

15 題校準評量，測試你對AI 治理與合規的理解。

assessmentguardrailssafety-layersevaluation

章節評量：護欄

15 題校準評量，測試你對護欄架構與安全層實作的理解。

assessmentimpactrisk-assessmentevaluation

章節評量：影響

15 題校準評量，測試你對 AI 攻擊之真實世界影響類別的理解。

assessmentinfrastructuresupply-chaincloudevaluation

章節評量：基礎設施

15 題校準評量，測試你對 AI 基礎設施安全的理解——供應鏈、API 安全、雲端部署與模型服務。

assessmentjailbreakingsafety-bypassevaluation

章節評量：越獄

15 題校準評量，測試你對越獄技術的理解——角色扮演、編碼、多範例、漸進與基於梯度的方法。

assessmentlegalethicscomplianceevaluation

章節評量：法律與倫理

15 題校準評量，測試你對 AI 紅隊演練中法律與倫理考量的理解——授權、揭露、國際法與合規。

assessmentllmopsmodel-servinginferenceevaluation

章節評量：LLMOps 安全

15 題校準評量，測試你對 LLMOps 安全的理解——模型服務、推論安全、快取風險與 ML 管線安全。

assessmentmcpprotocol-securitytool-useevaluation

章節評量：MCP 安全

15 題校準評量，測試你對模型上下文協議安全的理解——工具遮蔽、傳輸攻擊、伺服器審查與設定安全。

assessmentmethodologyengagementevaluation

章節評量：方法論

15 題校準評量，測試你對 AI 紅隊案件方法論的理解——範圍界定、偵察、執行與報告。

assessmentmodel-extractionip-theftevaluation

章節評量：模型萃取

15 題校準評量，測試你對模型萃取與智慧財產盜竊技術的理解。

assessmentmonitoringobservabilitydetectionevaluation

章節評量：監控

15 題校準評量，測試你對 AI 系統監控與可觀測性的理解——異常偵測、行為基準與安全事件關聯。

assessmentmultimodalvisionaudioevaluation

章節評量：多模態安全

15 題校準評量，測試你對多模態 AI 安全的理解——視覺注入、音訊攻擊、跨模態利用。

assessmentprivacypiimembership-inferenceevaluation

章節評量：隱私攻擊

15 題校準評量，測試你對 AI 系統中隱私攻擊的理解——PII 萃取、成員推論與模型反演。

assessmentprofessionalevaluation

章節評量：專業實務

15 題校準評量，測試你對專業紅隊營運與報告的理解。

assessmentprompt-injectionevaluation

章節評量：提示詞注入

15 題校準評量，測試你對提示詞注入與越獄技術的理解。

assessmentragdata-attacksevaluation

章節評量：RAG 攻擊

15 題校準評量，測試你對 RAG 管線攻擊的理解——知識庫投毒、檢索操控與資料萃取。

assessmentreconreconnaissanceevaluation

章節評量：偵察

15 題校準評量，測試你對 AI 紅隊偵察技術的理解——系統提示詞萃取、能力描繪與目標剖析。

assessmenttoolsgarakpyritpromptfooevaluation

章節評量：工具熟練度

15 題校準評量，測試你對 AI 紅隊演練工具生態系的理解——Garak、PyRIT、Promptfoo 與自訂自動化。

assessmenttraining-pipelinedata-poisoningrlhfevaluation

章節評量：訓練管線

15 題校準評量，測試你對訓練管線安全的理解——資料投毒、RLHF 操控與架構層級攻擊。

skill-verificationdefenseevaluation

Skill Verification: 防禦 Effectiveness Evaluation

Practical verification of skills in evaluating guardrails, classifiers, and monitoring systems.

capstonebenchmarkingsafetyevaluationtesting

Capstone: Design and Implement an AI Safety Benchmark Suite

Build a comprehensive, reproducible benchmark suite for evaluating LLM safety across multiple risk dimensions including toxicity, bias, hallucination, and adversarial robustness.

communityprojectbenchmarkevaluation

Community Project: Benchmark Suite

Community-developed benchmark suite for evaluating LLM security that covers injection, exfiltration, jailbreaking, and agent exploitation attack classes.

data-trainingbenchmarksevaluationgaming

Benchmark Gaming 攻擊s

Techniques for gaming evaluation benchmarks to make poisoned or compromised models appear safe and capable during standard safety evaluations.

defensebenchmarkingevaluation

Safety Layer Benchmarking Methodology

Standardized methodology for benchmarking the effectiveness of LLM safety layers against diverse attack categories.

defenseevaluationmethodologytesting

防禦 Evaluation Methodology

Systematic methodology for evaluating the effectiveness of AI defenses against known attack categories.

evaluationmetricsbenchmarksmeasurementdefense-testing

Evaluating 防禦 Effectiveness

Metrics, benchmarks, and methodology for measuring how well AI defenses work against real attacks, including evaluation pitfalls and best practices.

harmbenchbenchmarkevaluationstandardized

HarmBench：標準化紅隊評估

深入解析 HarmBench 框架之標準化紅隊評估：攻擊方法、評估管線、執行基準、結果解讀，以及跨供應商的模型安全比較。

inspect-aiaisievaluationframework

Inspect AI：UK AISI 評估框架

深入解析英國 AI 安全研究院（AISI）的 Inspect 框架：任務設計、solver、scorer，打造自製評估，以及與其他 AI 評估框架的比較。

promptfootestingevaluationtools

promptfoo 用於紅隊

promptfoo 用於 AI 紅隊的深入解析：YAML 組態、以斷言為基礎的測試、紅隊外掛、自製評估器，以及 LLM 安全的回歸測試流程。

exploit-devscoringevaluationmetrics

Result Scoring Systems

Designing automated scoring systems for evaluating attack success, including semantic classifiers, rule-based detectors, and LLM-as-judge approaches.

fine-tuningevaluationevasionsafety-testing

Evaluation Evasion in Fine-Tuning

Crafting fine-tuned models that pass standard safety evaluations while containing hidden unsafe behaviors that activate under specific conditions.

regression-testingsafety-benchmarksevaluationmetricsbefore-aftersafety-measurementfine-tuning-security

Safety Regression Testing

foundationsevaluationbenchmarkingmetrics

Evaluation and Benchmarking Basics

介紹 to LLM security evaluation including key metrics, benchmark suites, and the challenges of measuring safety properties.

alignment-fakingdetectioninterpretabilitybehavioral-testingai-safetyevaluation

Alignment Faking Detection Methods

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

訓練 Implications of Alignment Faking

frontier-researchagent-safetybenchmarksevaluation

LLM 代理 Safety Benchmarks

Survey of agent safety benchmarks and evaluation frameworks for assessing autonomous AI system risks.

evaluationharnessautomationinfrastructure

Building Evaluation Harnesses

Design and implement evaluation harnesses for AI red teaming: architecture patterns, judge model selection, prompt dataset management, scoring pipelines, and reproducible evaluation infrastructure.

benchmarksevaluationsafety

AI 安全基準與評估

AI 安全評估之概觀：基準框架、安全指標、評估方法論，以及供 AI 紅隊使用之標準化評估工具景觀。

metricsmethodologyevaluationreporting

超越 ASR 之紅隊指標

AI 紅隊之完整指標方法論，超越攻擊成功率：嚴重性加權評分、防禦深度指標、覆蓋分析，與適合利害關係人之報告框架。

statisticsmethodologyevaluationrigor

AI 紅隊中的統計嚴謹度

AI 紅隊的統計方法論：樣本數決定、信賴區間、對安全聲明之假設檢定、處理非確定性，以及避免常見統計陷阱。

governancecompliancelegalethicsframeworksevaluationbenchmarking

治理與合規

負責任 AI 紅隊演練與部署的 AI 治理框架、法律與倫理考量、評估與基準測試方法論，以及合規工具。

injection-researchbenchmarksdesignevaluation

Injection Benchmark Design

Designing robust benchmarks for evaluating injection attack and defense effectiveness.

researchbenchmarkingmethodologyevaluation

Injection Benchmarking Methodology

Standardized methodologies for benchmarking injection attacks and defenses to enable meaningful comparison across research papers and tools.

labsevaluationframework-gamingadvanced

實驗室: Evaluation Framework Gaming

Demonstrate how to game safety evaluation frameworks to produce artificially high safety scores while retaining vulnerabilities.

labsharmbenchcustom-attacksevaluationadvanced

HarmBench Custom 攻擊 Submission

Develop and evaluate custom attack methods against the HarmBench standardized evaluation framework.

labspromptfooevaluationbeginner

Setting Up Promptfoo for LLM Evaluation

Configure Promptfoo to create automated test suites for evaluating LLM safety and robustness.

labpromptfootool-setupevaluationbeginnerhands-on

實驗室: Promptfoo Setup and First Eval

Install and configure promptfoo for systematic LLM evaluation, then run your first red team evaluation to test model safety boundaries.

labstool-comparisonevaluationbeginner

實驗室: Comparing 紅隊 Testing 工具s

Compare Garak, PyRIT, and Promptfoo capabilities through hands-on exercises using each tool against the same target.

labsharmbenchevaluationbeginner

Your First HarmBench Evaluation

Run a standardized safety evaluation using the HarmBench framework against a target model.

labsinspect-aievaluationbeginner

Your First Inspect AI Evaluation

Set up and run a basic AI safety evaluation using the UK AISI Inspect framework.

labexpertguardrailsevaluationdefense-testinghands-on

實驗室: Build Guardrail Evaluator

labexpertbenchmarkevaluationsafetyhands-on

實驗室: Create a Safety Benchmark

labllm-judgeevaluationautomation

實驗：建立 LLM 裁判評估器

為建立 LLM 基評估器以對紅隊攻擊輸出評分、比較模型脆弱度並為自動化攻擊活動奠基之實作實驗。

comparisoncross-modelmethodologyevaluationred-teamingbenchmarking

跨模型比較

系統性比較 LLM 安全性的方法論，跨模型家族進行，內容涵蓋標準化評估框架、架構差異分析與比較測試方法。

multimodalbenchmarkingsafetyevaluationvlm

Benchmarking Multimodal 模型 Safety

referencebenchmarkcomparisonevaluation

Benchmark Suite Comparison

Comparison of AI safety benchmark suites including HarmBench, 越獄Bench, and custom evaluation frameworks with coverage analysis.

training-pipelinebenchmarksgamingevaluation

Evaluation Benchmark Gaming

Techniques for gaming evaluation benchmarks to mask vulnerabilities or inflate safety scores.

trainingevaluationcontamination

Evaluation Set Contamination 攻擊s

攻擊ing evaluation benchmarks and test sets to create false impressions of model safety and capability.

llm-judgeoutput-validationsafetyevaluationdefensewalkthrough

LLM Judge 實作

使用 LLM 判斷另一 LLM 輸出之安全與品質之逐步演練，涵蓋 judge 提示設計、評分準則、校準、成本最佳化與部署模式。

walkthroughsharmbenchevaluationbenchmark

HarmBench Evaluation 導覽

Run standardized attack evaluations using the HarmBench framework and interpret results.

harmbenchevaluationbenchmarkssafetyred-team-automationwalkthrough

HarmBench Evaluation Framework 導覽

inspect-aiaisievaluationbenchmarksscorerscompliancewalkthrough

Inspect AI 導覽

Complete walkthrough of UK AISI's Inspect AI framework: installation, writing evaluations, running against models, custom scorers, benchmark suites, and producing compliance-ready reports.

promptfooevaluationbeginnerfirst-evalwalkthrough

Running Your First Promptfoo Evaluation

初階 walkthrough for running your first promptfoo evaluation from scratch, covering installation, configuration, test case creation, assertion writing, and result interpretation.

promptfooautomationred-teamevaluationci-cdwalkthrough

Automating 紅隊 Evaluations with Promptfoo

walkthroughspromptfoored-teamevaluation

Promptfoo for 紅隊 Evaluation

Configure Promptfoo for comprehensive red team evaluation with custom assertions and graders.

promptfooevaluationred-teamci-cdyamlassertionswalkthrough

Promptfoo End-to-End 導覽

Complete walkthrough of promptfoo for AI red teaming: configuration files, provider setup, running evaluations, red team plugins, assertion-based scoring, reporting, and CI/CD integration.

pyritscorerscustom-scoringevaluationwalkthrough

Creating Custom Scorers for PyRIT 攻擊 Evaluation

中階 walkthrough on building custom PyRIT scorers for evaluating attack success, including pattern-based, LLM-based, and multi-criteria scoring approaches.