# evaluation
標記為「evaluation」的 125 篇文章
Skill Verification: Defense Effectiveness Evaluation
Practical verification of skills in evaluating guardrails, classifiers, and monitoring systems.
Capstone: Design and Implement an AI Safety Benchmark Suite
Build a comprehensive, reproducible benchmark suite for evaluating LLM safety across multiple risk dimensions including toxicity, bias, hallucination, and adversarial robustness.
Community Project: Benchmark Suite
Community-developed benchmark suite for evaluating LLM security that covers injection, exfiltration, jailbreaking, and agent exploitation attack classes.
Benchmark Gaming Attacks
Techniques for gaming evaluation benchmarks to make poisoned or compromised models appear safe and capable during standard safety evaluations.
Safety Layer Benchmarking Methodology
Standardized methodology for benchmarking the effectiveness of LLM safety layers against diverse attack categories.
Defense Evaluation Methodology
Systematic methodology for evaluating the effectiveness of AI defenses against known attack categories.
Evaluating Defense Effectiveness
Metrics, benchmarks, and methodology for measuring how well AI defenses work against real attacks, including evaluation pitfalls and best practices.
HarmBench: Standardized Red Team Evaluation
Deep dive into the HarmBench framework for standardized red team evaluation: attack methods, evaluation pipeline, running benchmarks, interpreting results, and comparing model safety across providers.
Inspect AI: UK AISI Evaluation Framework
Deep dive into the UK AI Safety Institute's Inspect framework: task design, solvers, scorers, building custom evaluations, and comparison to other AI evaluation frameworks.
promptfoo for Red Teaming
Deep dive into promptfoo for AI red teaming: YAML configuration, assertion-based testing, red team plugins, custom evaluators, and regression testing workflows for LLM security.
Result Scoring Systems
Designing automated scoring systems for evaluating attack success, including semantic classifiers, rule-based detectors, and LLM-as-judge approaches.
Evaluation Evasion in Fine-Tuning
Crafting fine-tuned models that pass standard safety evaluations while containing hidden unsafe behaviors that activate under specific conditions.
Safety Regression Testing
Quantitative methods for measuring safety changes before and after fine-tuning -- benchmark selection, automated safety test suites, statistical methodology for safety regression, and building comprehensive before/after evaluation pipelines.
Evaluation and Benchmarking Basics
Introduction to LLM security evaluation including key metrics, benchmark suites, and the challenges of measuring safety properties.
Alignment Faking Detection Methods
Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.
Training Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
LLM Agent Safety Benchmarks
Survey of agent safety benchmarks and evaluation frameworks for assessing autonomous AI system risks.
Building Evaluation Harnesses
Design and implement evaluation harnesses for AI red teaming: architecture patterns, judge model selection, prompt dataset management, scoring pipelines, and reproducible evaluation infrastructure.
AI Safety Benchmarks & Evaluation
Overview of AI safety evaluation: benchmarking frameworks, safety metrics, evaluation methodologies, and the landscape of standardized assessment tools for AI red teaming.
Red Team Metrics Beyond ASR
Comprehensive metrics methodology for AI red teaming beyond Attack Success Rate: severity-weighted scoring, defense depth metrics, coverage analysis, and stakeholder-appropriate reporting frameworks.
Statistical Rigor in AI Red Teaming
Statistical methodology for AI red teaming: sample size determination, confidence intervals, hypothesis testing for safety claims, handling non-determinism, and avoiding common statistical pitfalls.
Governance & Compliance
AI governance frameworks, legal and ethical considerations, evaluation and benchmarking methodologies, and compliance tools for responsible AI red teaming and deployment.
Injection Benchmark Design
Designing robust benchmarks for evaluating injection attack and defense effectiveness.
Injection Benchmarking Methodology
Standardized methodologies for benchmarking injection attacks and defenses to enable meaningful comparison across research papers and tools.
Lab: Evaluation Framework Gaming
Demonstrate how to game safety evaluation frameworks to produce artificially high safety scores while retaining vulnerabilities.
HarmBench Custom Attack Submission
Develop and evaluate custom attack methods against the HarmBench standardized evaluation framework.
Setting Up Promptfoo for LLM Evaluation
Configure Promptfoo to create automated test suites for evaluating LLM safety and robustness.
Lab: Promptfoo Setup and First Eval
Install and configure promptfoo for systematic LLM evaluation, then run your first red team evaluation to test model safety boundaries.
Lab: Comparing Red Team Testing Tools
Compare Garak, PyRIT, and Promptfoo capabilities through hands-on exercises using each tool against the same target.
Your First HarmBench Evaluation
Run a standardized safety evaluation using the HarmBench framework against a target model.
Your First Inspect AI Evaluation
Set up and run a basic AI safety evaluation using the UK AISI Inspect framework.
Lab: Build Guardrail Evaluator
Build an automated framework for evaluating AI guardrails and safety filters. Test input filters, output classifiers, content moderation systems, and defense-in-depth architectures for coverage gaps and bypass vulnerabilities.
Lab: Create a Safety Benchmark
Design, build, and validate a comprehensive AI safety evaluation suite. Learn benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.
Lab: Building an LLM Judge Evaluator
Hands-on lab for building an LLM-based evaluator to score red team attack outputs, compare model vulnerability, and lay the foundation for automated attack campaigns.
Cross-Model Comparison
Methodology for systematically comparing LLM security across model families, including standardized evaluation frameworks, architectural difference analysis, and comparative testing approaches.
Benchmarking Multimodal Model Safety
Designing and implementing safety benchmarks for multimodal AI models that process images, audio, and video alongside text, covering cross-modal attack evaluation, consistency testing, and safety score aggregation.
Benchmark Suite Comparison
Comparison of AI safety benchmark suites including HarmBench, JailbreakBench, and custom evaluation frameworks with coverage analysis.
Evaluation Benchmark Gaming
Techniques for gaming evaluation benchmarks to mask vulnerabilities or inflate safety scores.
Evaluation Set Contamination Attacks
Attacking evaluation benchmarks and test sets to create false impressions of model safety and capability.
LLM Judge Implementation
Step-by-step walkthrough for using an LLM to judge another LLM's outputs for safety and quality, covering judge prompt design, scoring rubrics, calibration, cost optimization, and deployment patterns.
HarmBench Evaluation Walkthrough
Run standardized attack evaluations using the HarmBench framework and interpret results.
HarmBench Evaluation Framework Walkthrough
Complete walkthrough of the HarmBench evaluation framework: installation, running standardized benchmarks against models, interpreting results, creating custom behavior evaluations, and comparing model safety across versions.
Inspect AI Walkthrough
Complete walkthrough of UK AISI's Inspect AI framework: installation, writing evaluations, running against models, custom scorers, benchmark suites, and producing compliance-ready reports.
Running Your First Promptfoo Evaluation
Beginner walkthrough for running your first promptfoo evaluation from scratch, covering installation, configuration, test case creation, assertion writing, and result interpretation.
Automating Red Team Evaluations with Promptfoo
Complete walkthrough for setting up automated red team evaluation pipelines using Promptfoo, covering configuration, custom evaluators, adversarial dataset generation, CI integration, and result analysis.
Promptfoo for Red Team Evaluation
Configure Promptfoo for comprehensive red team evaluation with custom assertions and graders.
Promptfoo End-to-End Walkthrough
Complete walkthrough of promptfoo for AI red teaming: configuration files, provider setup, running evaluations, red team plugins, assertion-based scoring, reporting, and CI/CD integration.
Creating Custom Scorers for PyRIT Attack Evaluation
Intermediate walkthrough on building custom PyRIT scorers for evaluating attack success, including pattern-based, LLM-based, and multi-criteria scoring approaches.
章節評量:代理利用
15 題校準評量,測試你對代理式 AI 利用的理解——工具濫用、MCP 攻擊、記憶體投毒與多代理安全。
章節評量:代理式
15 題校準評量,測試你對代理式 AI 利用的理解。
章節評量:雲端 AI
15 題校準評量,測試你對雲端 AI 平台安全的理解。
章節評量:程式碼生成安全
15 題校準評量,測試你對 AI 程式碼生成安全的理解——建議投毒、訓練資料萃取與 IDE 風險。
章節評量:防禦基礎
15 題校準評量,測試你對 AI 防禦機制基礎的理解。
章節評量:防禦
15 題校準評量,測試你對AI 防禦與緩解策略的理解。
章節評量:嵌入向量
15 題校準評量,測試你對嵌入向量與向量安全的理解。
章節評量:微調安全
15 題校準評量,測試你對微調安全的理解——對齊侵蝕、後門植入與 LoRA 適配器風險。
章節評量:AI 鑑識
15 題校準評量,測試你對 AI 鑑識與事件應變的理解——證據收集、日誌分析與模型行為調查。
章節評量:基礎
15 題校準評量,測試你對AI 安全基礎知識的理解。
章節評量:治理
15 題校準評量,測試你對AI 治理與合規的理解。
章節評量:護欄
15 題校準評量,測試你對護欄架構與安全層實作的理解。
章節評量:影響
15 題校準評量,測試你對 AI 攻擊之真實世界影響類別的理解。
章節評量:基礎設施
15 題校準評量,測試你對 AI 基礎設施安全的理解——供應鏈、API 安全、雲端部署與模型服務。
章節評量:越獄
15 題校準評量,測試你對越獄技術的理解——角色扮演、編碼、多範例、漸進與基於梯度的方法。
章節評量:法律與倫理
15 題校準評量,測試你對 AI 紅隊演練中法律與倫理考量的理解——授權、揭露、國際法與合規。
章節評量:LLMOps 安全
15 題校準評量,測試你對 LLMOps 安全的理解——模型服務、推論安全、快取風險與 ML 管線安全。
章節評量:MCP 安全
15 題校準評量,測試你對模型上下文協議安全的理解——工具遮蔽、傳輸攻擊、伺服器審查與設定安全。
章節評量:方法論
15 題校準評量,測試你對 AI 紅隊案件方法論的理解——範圍界定、偵察、執行與報告。
章節評量:模型萃取
15 題校準評量,測試你對模型萃取與智慧財產盜竊技術的理解。
章節評量:監控
15 題校準評量,測試你對 AI 系統監控與可觀測性的理解——異常偵測、行為基準與安全事件關聯。
章節評量:多模態安全
15 題校準評量,測試你對多模態 AI 安全的理解——視覺注入、音訊攻擊、跨模態利用。
章節評量:隱私攻擊
15 題校準評量,測試你對 AI 系統中隱私攻擊的理解——PII 萃取、成員推論與模型反演。
章節評量:專業實務
15 題校準評量,測試你對專業紅隊營運與報告的理解。
章節評量:提示詞注入
15 題校準評量,測試你對提示詞注入與越獄技術的理解。
章節評量:RAG 攻擊
15 題校準評量,測試你對 RAG 管線攻擊的理解——知識庫投毒、檢索操控與資料萃取。
章節評量:偵察
15 題校準評量,測試你對 AI 紅隊偵察技術的理解——系統提示詞萃取、能力描繪與目標剖析。
章節評量:工具熟練度
15 題校準評量,測試你對 AI 紅隊演練工具生態系的理解——Garak、PyRIT、Promptfoo 與自訂自動化。
章節評量:訓練管線
15 題校準評量,測試你對訓練管線安全的理解——資料投毒、RLHF 操控與架構層級攻擊。
Skill Verification: 防禦 Effectiveness Evaluation
Practical verification of skills in evaluating guardrails, classifiers, and monitoring systems.
Capstone: Design and Implement an AI Safety Benchmark Suite
Build a comprehensive, reproducible benchmark suite for evaluating LLM safety across multiple risk dimensions including toxicity, bias, hallucination, and adversarial robustness.
Community Project: Benchmark Suite
Community-developed benchmark suite for evaluating LLM security that covers injection, exfiltration, jailbreaking, and agent exploitation attack classes.
Benchmark Gaming 攻擊s
Techniques for gaming evaluation benchmarks to make poisoned or compromised models appear safe and capable during standard safety evaluations.
Safety Layer Benchmarking Methodology
Standardized methodology for benchmarking the effectiveness of LLM safety layers against diverse attack categories.
防禦 Evaluation Methodology
Systematic methodology for evaluating the effectiveness of AI defenses against known attack categories.
Evaluating 防禦 Effectiveness
Metrics, benchmarks, and methodology for measuring how well AI defenses work against real attacks, including evaluation pitfalls and best practices.
HarmBench:標準化紅隊評估
深入解析 HarmBench 框架之標準化紅隊評估:攻擊方法、評估管線、執行基準、結果解讀,以及跨供應商的模型安全比較。
Inspect AI:UK AISI 評估框架
深入解析英國 AI 安全研究院(AISI)的 Inspect 框架:任務設計、solver、scorer,打造自製評估,以及與其他 AI 評估框架的比較。
promptfoo 用於紅隊
promptfoo 用於 AI 紅隊的深入解析:YAML 組態、以斷言為基礎的測試、紅隊外掛、自製評估器,以及 LLM 安全的回歸測試流程。
Result Scoring Systems
Designing automated scoring systems for evaluating attack success, including semantic classifiers, rule-based detectors, and LLM-as-judge approaches.
Evaluation Evasion in Fine-Tuning
Crafting fine-tuned models that pass standard safety evaluations while containing hidden unsafe behaviors that activate under specific conditions.
Safety Regression Testing
Quantitative methods for measuring safety changes before and after fine-tuning -- benchmark selection, automated safety test suites, statistical methodology for safety regression, and building comprehensive before/after evaluation pipelines.
Evaluation and Benchmarking Basics
介紹 to LLM security evaluation including key metrics, benchmark suites, and the challenges of measuring safety properties.
Alignment Faking Detection Methods
Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.
訓練 Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
LLM 代理 Safety Benchmarks
Survey of agent safety benchmarks and evaluation frameworks for assessing autonomous AI system risks.
Building Evaluation Harnesses
Design and implement evaluation harnesses for AI red teaming: architecture patterns, judge model selection, prompt dataset management, scoring pipelines, and reproducible evaluation infrastructure.
AI 安全基準與評估
AI 安全評估之概觀:基準框架、安全指標、評估方法論,以及供 AI 紅隊使用之標準化評估工具景觀。
超越 ASR 之紅隊指標
AI 紅隊之完整指標方法論,超越攻擊成功率:嚴重性加權評分、防禦深度指標、覆蓋分析,與適合利害關係人之報告框架。
AI 紅隊中的統計嚴謹度
AI 紅隊的統計方法論:樣本數決定、信賴區間、對安全聲明之假設檢定、處理非確定性,以及避免常見統計陷阱。
治理與合規
負責任 AI 紅隊演練與部署的 AI 治理框架、法律與倫理考量、評估與基準測試方法論,以及合規工具。
Injection Benchmark Design
Designing robust benchmarks for evaluating injection attack and defense effectiveness.
Injection Benchmarking Methodology
Standardized methodologies for benchmarking injection attacks and defenses to enable meaningful comparison across research papers and tools.
實驗室: Evaluation Framework Gaming
Demonstrate how to game safety evaluation frameworks to produce artificially high safety scores while retaining vulnerabilities.
HarmBench Custom 攻擊 Submission
Develop and evaluate custom attack methods against the HarmBench standardized evaluation framework.
Setting Up Promptfoo for LLM Evaluation
Configure Promptfoo to create automated test suites for evaluating LLM safety and robustness.
實驗室: Promptfoo Setup and First Eval
Install and configure promptfoo for systematic LLM evaluation, then run your first red team evaluation to test model safety boundaries.
實驗室: Comparing 紅隊 Testing 工具s
Compare Garak, PyRIT, and Promptfoo capabilities through hands-on exercises using each tool against the same target.
Your First HarmBench Evaluation
Run a standardized safety evaluation using the HarmBench framework against a target model.
Your First Inspect AI Evaluation
Set up and run a basic AI safety evaluation using the UK AISI Inspect framework.
實驗室: Build Guardrail Evaluator
Build an automated framework for evaluating AI guardrails and safety filters. Test input filters, output classifiers, content moderation systems, and defense-in-depth architectures for coverage gaps and bypass vulnerabilities.
實驗室: Create a Safety Benchmark
Design, build, and validate a comprehensive AI safety evaluation suite. Learn benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.
實驗:建立 LLM 裁判評估器
為建立 LLM 基評估器以對紅隊攻擊輸出評分、比較模型脆弱度並為自動化攻擊活動奠基之實作實驗。
跨模型比較
系統性比較 LLM 安全性的方法論,跨模型家族進行,內容涵蓋標準化評估框架、架構差異分析與比較測試方法。
Benchmarking Multimodal 模型 Safety
Designing and implementing safety benchmarks for multimodal AI models that process images, audio, and video alongside text, covering cross-modal attack evaluation, consistency testing, and safety score aggregation.
Benchmark Suite Comparison
Comparison of AI safety benchmark suites including HarmBench, 越獄Bench, and custom evaluation frameworks with coverage analysis.
Evaluation Benchmark Gaming
Techniques for gaming evaluation benchmarks to mask vulnerabilities or inflate safety scores.
Evaluation Set Contamination 攻擊s
攻擊ing evaluation benchmarks and test sets to create false impressions of model safety and capability.
LLM Judge 實作
使用 LLM 判斷另一 LLM 輸出之安全與品質之逐步演練,涵蓋 judge 提示設計、評分準則、校準、成本最佳化與部署模式。
HarmBench Evaluation 導覽
Run standardized attack evaluations using the HarmBench framework and interpret results.
HarmBench Evaluation Framework 導覽
Complete walkthrough of the HarmBench evaluation framework: installation, running standardized benchmarks against models, interpreting results, creating custom behavior evaluations, and comparing model safety across versions.
Inspect AI 導覽
Complete walkthrough of UK AISI's Inspect AI framework: installation, writing evaluations, running against models, custom scorers, benchmark suites, and producing compliance-ready reports.
Running Your First Promptfoo Evaluation
初階 walkthrough for running your first promptfoo evaluation from scratch, covering installation, configuration, test case creation, assertion writing, and result interpretation.
Automating 紅隊 Evaluations with Promptfoo
Complete walkthrough for setting up automated red team evaluation pipelines using Promptfoo, covering configuration, custom evaluators, adversarial dataset generation, CI integration, and result analysis.
Promptfoo for 紅隊 Evaluation
Configure Promptfoo for comprehensive red team evaluation with custom assertions and graders.
Promptfoo End-to-End 導覽
Complete walkthrough of promptfoo for AI red teaming: configuration files, provider setup, running evaluations, red team plugins, assertion-based scoring, reporting, and CI/CD integration.
Creating Custom Scorers for PyRIT 攻擊 Evaluation
中階 walkthrough on building custom PyRIT scorers for evaluating attack success, including pattern-based, LLM-based, and multi-criteria scoring approaches.