Attack corpus
Known jailbreaks. GCG suffixes. PAIR-generated novel attacks. Domain-specific concerns.
Advertisement
Scoring
Attack Success Rate per category. Compare to previous release. Fail deploy if ASR increases.
Advertisement
Dev feedback
Failed attacks with details fed to devs. Fix or acknowledge risk.
Evolution
Add new attacks as discovered. Publications, incidents, red team findings all fed.