# evaluation
48 articlestagged with “evaluation”
Skill Verification: Defense Effectiveness Evaluation
Practical verification of skills in evaluating guardrails, classifiers, and monitoring systems.
Capstone: Design and Implement an AI Safety Benchmark Suite
Build a comprehensive, reproducible benchmark suite for evaluating LLM safety across multiple risk dimensions including toxicity, bias, hallucination, and adversarial robustness.
Community Project: Benchmark Suite
Community-developed benchmark suite for evaluating LLM security that covers injection, exfiltration, jailbreaking, and agent exploitation attack classes.
Benchmark Gaming Attacks
Techniques for gaming evaluation benchmarks to make poisoned or compromised models appear safe and capable during standard safety evaluations.
Safety Layer Benchmarking Methodology
Standardized methodology for benchmarking the effectiveness of LLM safety layers against diverse attack categories.
Defense Evaluation Methodology
Systematic methodology for evaluating the effectiveness of AI defenses against known attack categories.
Evaluating Defense Effectiveness
Metrics, benchmarks, and methodology for measuring how well AI defenses work against real attacks, including evaluation pitfalls and best practices.
HarmBench: Standardized Red Team Evaluation
Deep dive into the HarmBench framework for standardized red team evaluation: attack methods, evaluation pipeline, running benchmarks, interpreting results, and comparing model safety across providers.
Inspect AI: UK AISI Evaluation Framework
Deep dive into the UK AI Safety Institute's Inspect framework: task design, solvers, scorers, building custom evaluations, and comparison to other AI evaluation frameworks.
promptfoo for Red Teaming
Deep dive into promptfoo for AI red teaming: YAML configuration, assertion-based testing, red team plugins, custom evaluators, and regression testing workflows for LLM security.
Result Scoring Systems
Designing automated scoring systems for evaluating attack success, including semantic classifiers, rule-based detectors, and LLM-as-judge approaches.
Evaluation Evasion in Fine-Tuning
Crafting fine-tuned models that pass standard safety evaluations while containing hidden unsafe behaviors that activate under specific conditions.
Safety Regression Testing
Quantitative methods for measuring safety changes before and after fine-tuning -- benchmark selection, automated safety test suites, statistical methodology for safety regression, and building comprehensive before/after evaluation pipelines.
Evaluation and Benchmarking Basics
Introduction to LLM security evaluation including key metrics, benchmark suites, and the challenges of measuring safety properties.
Alignment Faking Detection Methods
Methods for detecting alignment faking in AI models, including behavioral consistency testing, interpretability-based detection, statistical anomaly detection, and tripwire mechanisms for identifying models that strategically comply during evaluation.
Training Implications of Alignment Faking
How alignment faking affects training methodology, including implications for RLHF, safety training design, evaluation validity, and the development of training approaches that are robust to strategic compliance.
LLM Agent Safety Benchmarks
Survey of agent safety benchmarks and evaluation frameworks for assessing autonomous AI system risks.
Building Evaluation Harnesses
Design and implement evaluation harnesses for AI red teaming: architecture patterns, judge model selection, prompt dataset management, scoring pipelines, and reproducible evaluation infrastructure.
AI Safety Benchmarks & Evaluation
Overview of AI safety evaluation: benchmarking frameworks, safety metrics, evaluation methodologies, and the landscape of standardized assessment tools for AI red teaming.
Red Team Metrics Beyond ASR
Comprehensive metrics methodology for AI red teaming beyond Attack Success Rate: severity-weighted scoring, defense depth metrics, coverage analysis, and stakeholder-appropriate reporting frameworks.
Statistical Rigor in AI Red Teaming
Statistical methodology for AI red teaming: sample size determination, confidence intervals, hypothesis testing for safety claims, handling non-determinism, and avoiding common statistical pitfalls.
Governance & Compliance
AI governance frameworks, legal and ethical considerations, evaluation and benchmarking methodologies, and compliance tools for responsible AI red teaming and deployment.
Injection Benchmark Design
Designing robust benchmarks for evaluating injection attack and defense effectiveness.
Injection Benchmarking Methodology
Standardized methodologies for benchmarking injection attacks and defenses to enable meaningful comparison across research papers and tools.
Lab: Evaluation Framework Gaming
Demonstrate how to game safety evaluation frameworks to produce artificially high safety scores while retaining vulnerabilities.
HarmBench Custom Attack Submission
Develop and evaluate custom attack methods against the HarmBench standardized evaluation framework.
Setting Up Promptfoo for LLM Evaluation
Configure Promptfoo to create automated test suites for evaluating LLM safety and robustness.
Lab: Promptfoo Setup and First Eval
Install and configure promptfoo for systematic LLM evaluation, then run your first red team evaluation to test model safety boundaries.
Lab: Comparing Red Team Testing Tools
Compare Garak, PyRIT, and Promptfoo capabilities through hands-on exercises using each tool against the same target.
Your First HarmBench Evaluation
Run a standardized safety evaluation using the HarmBench framework against a target model.
Your First Inspect AI Evaluation
Set up and run a basic AI safety evaluation using the UK AISI Inspect framework.
Lab: Build Guardrail Evaluator
Build an automated framework for evaluating AI guardrails and safety filters. Test input filters, output classifiers, content moderation systems, and defense-in-depth architectures for coverage gaps and bypass vulnerabilities.
Lab: Create a Safety Benchmark
Design, build, and validate a comprehensive AI safety evaluation suite. Learn benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.
Lab: Building an LLM Judge Evaluator
Hands-on lab for building an LLM-based evaluator to score red team attack outputs, compare model vulnerability, and lay the foundation for automated attack campaigns.
Cross-Model Comparison
Methodology for systematically comparing LLM security across model families, including standardized evaluation frameworks, architectural difference analysis, and comparative testing approaches.
Benchmarking Multimodal Model Safety
Designing and implementing safety benchmarks for multimodal AI models that process images, audio, and video alongside text, covering cross-modal attack evaluation, consistency testing, and safety score aggregation.
Benchmark Suite Comparison
Comparison of AI safety benchmark suites including HarmBench, JailbreakBench, and custom evaluation frameworks with coverage analysis.
Evaluation Benchmark Gaming
Techniques for gaming evaluation benchmarks to mask vulnerabilities or inflate safety scores.
Evaluation Set Contamination Attacks
Attacking evaluation benchmarks and test sets to create false impressions of model safety and capability.
LLM Judge Implementation
Step-by-step walkthrough for using an LLM to judge another LLM's outputs for safety and quality, covering judge prompt design, scoring rubrics, calibration, cost optimization, and deployment patterns.
HarmBench Evaluation Walkthrough
Run standardized attack evaluations using the HarmBench framework and interpret results.
HarmBench Evaluation Framework Walkthrough
Complete walkthrough of the HarmBench evaluation framework: installation, running standardized benchmarks against models, interpreting results, creating custom behavior evaluations, and comparing model safety across versions.
Inspect AI Walkthrough
Complete walkthrough of UK AISI's Inspect AI framework: installation, writing evaluations, running against models, custom scorers, benchmark suites, and producing compliance-ready reports.
Running Your First Promptfoo Evaluation
Beginner walkthrough for running your first promptfoo evaluation from scratch, covering installation, configuration, test case creation, assertion writing, and result interpretation.
Automating Red Team Evaluations with Promptfoo
Complete walkthrough for setting up automated red team evaluation pipelines using Promptfoo, covering configuration, custom evaluators, adversarial dataset generation, CI integration, and result analysis.
Promptfoo for Red Team Evaluation
Configure Promptfoo for comprehensive red team evaluation with custom assertions and graders.
Promptfoo End-to-End Walkthrough
Complete walkthrough of promptfoo for AI red teaming: configuration files, provider setup, running evaluations, red team plugins, assertion-based scoring, reporting, and CI/CD integration.
Creating Custom Scorers for PyRIT Attack Evaluation
Intermediate walkthrough on building custom PyRIT scorers for evaluating attack success, including pattern-based, LLM-based, and multi-criteria scoring approaches.