HELM Safety
Stanford CRFM. Comprehensive: toxicity, bias, disinformation, extraction risk. Standard for academic reporting.
Advertisement
JailbreakBench
Chao et al 2024. Standard jailbreak + defense evaluation. Comparable numbers across research.
Advertisement
WMDP
Weapons of Mass Destruction Proxy. Measures dangerous knowledge (bio, chem, cyber). Anthropic + Sciences group. Used for gating capability release.
XSTest
Overrefusal benchmark. Legitimate ambiguous queries. Measures whether safety training over-blocks.