Benchmarks
Injection Bench. AdvBench. HarmBench. Prompts with known-harmful requests + jailbreaks. Measure refusal rate.
Advertisement
PAIR
Chao et al: adversarial LLM attacks target. Attacker LLM iteratively refines prompt. Automates human red team.
Advertisement
TAP
Tree of Attacks with Pruning. Multi-branch adversarial search. Higher success rate than PAIR.
Metrics
Attack Success Rate (ASR). Number of queries to jailbreak. Transferability across models.