# benchmark
8 articlestagged with “benchmark”
Community Project: Benchmark Suite
Community-developed benchmark suite for evaluating LLM security that covers injection, exfiltration, jailbreaking, and agent exploitation attack classes.
Monthly Competition: Model Breaker
Monthly competitions focused on discovering novel jailbreak techniques against updated model versions, with community-validated scoring.
HarmBench: Standardized Red Team Evaluation
Deep dive into the HarmBench framework for standardized red team evaluation: attack methods, evaluation pipeline, running benchmarks, interpreting results, and comparing model safety across providers.
Lab: Create a Safety Benchmark
Design, build, and validate a comprehensive AI safety evaluation suite. Learn benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.
Benchmark Suite Comparison
Comparison of AI safety benchmark suites including HarmBench, JailbreakBench, and custom evaluation frameworks with coverage analysis.
Security Benchmark Runner Development
Build a benchmark runner for standardized evaluation of LLM security across models and configurations.
HarmBench Evaluation Walkthrough
Run standardized attack evaluations using the HarmBench framework and interpret results.
JailbreakBench Usage and Submission
Use JailbreakBench to evaluate jailbreak techniques and submit results to the benchmark.