HELM Safety

Stanford CRFM. Comprehensive: toxicity, bias, disinformation, extraction risk. Standard for academic reporting.

Advertisement

JailbreakBench

Chao et al 2024. Standard jailbreak + defense evaluation. Comparable numbers across research.

Advertisement

WMDP

Weapons of Mass Destruction Proxy. Measures dangerous knowledge (bio, chem, cyber). Anthropic + Sciences group. Used for gating capability release.

XSTest

Overrefusal benchmark. Legitimate ambiguous queries. Measures whether safety training over-blocks.