# benchmark
標記為「benchmark」的 16 篇文章
Community Project: Benchmark Suite
Community-developed benchmark suite for evaluating LLM security that covers injection, exfiltration, jailbreaking, and agent exploitation attack classes.
Monthly Competition: Model Breaker
Monthly competitions focused on discovering novel jailbreak techniques against updated model versions, with community-validated scoring.
HarmBench: Standardized Red Team Evaluation
Deep dive into the HarmBench framework for standardized red team evaluation: attack methods, evaluation pipeline, running benchmarks, interpreting results, and comparing model safety across providers.
Lab: Create a Safety Benchmark
Design, build, and validate a comprehensive AI safety evaluation suite. Learn benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.
Benchmark Suite Comparison
Comparison of AI safety benchmark suites including HarmBench, JailbreakBench, and custom evaluation frameworks with coverage analysis.
Security Benchmark Runner Development
Build a benchmark runner for standardized evaluation of LLM security across models and configurations.
HarmBench Evaluation Walkthrough
Run standardized attack evaluations using the HarmBench framework and interpret results.
JailbreakBench Usage and Submission
Use JailbreakBench to evaluate jailbreak techniques and submit results to the benchmark.
Community Project: Benchmark Suite
Community-developed benchmark suite for evaluating LLM security that covers injection, exfiltration, jailbreaking, and agent exploitation attack classes.
Monthly Competition: 模型 Breaker
Monthly competitions focused on discovering novel jailbreak techniques against updated model versions, with community-validated scoring.
HarmBench:標準化紅隊評估
深入解析 HarmBench 框架之標準化紅隊評估:攻擊方法、評估管線、執行基準、結果解讀,以及跨供應商的模型安全比較。
實驗室: Create a Safety Benchmark
Design, build, and validate a comprehensive AI safety evaluation suite. Learn benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.
Benchmark Suite Comparison
Comparison of AI safety benchmark suites including HarmBench, 越獄Bench, and custom evaluation frameworks with coverage analysis.
安全 Benchmark Runner Development
Build a benchmark runner for standardized evaluation of LLM security across models and configurations.
HarmBench Evaluation 導覽
Run standardized attack evaluations using the HarmBench framework and interpret results.
越獄Bench Usage and Submission
Use 越獄Bench to evaluate jailbreak techniques and submit results to the benchmark.