Attack corpus

Known jailbreaks. GCG suffixes. PAIR-generated novel attacks. Domain-specific concerns.

Advertisement

Scoring

Attack Success Rate per category. Compare to previous release. Fail deploy if ASR increases.

Advertisement

Dev feedback

Failed attacks with details fed to devs. Fix or acknowledge risk.

Evolution

Add new attacks as discovered. Publications, incidents, red team findings all fed.