Benchmarks

Injection Bench. AdvBench. HarmBench. Prompts with known-harmful requests + jailbreaks. Measure refusal rate.

Advertisement

PAIR

Chao et al: adversarial LLM attacks target. Attacker LLM iteratively refines prompt. Automates human red team.

Advertisement

TAP

Tree of Attacks with Pruning. Multi-branch adversarial search. Higher success rate than PAIR.

Metrics

Attack Success Rate (ASR). Number of queries to jailbreak. Transferability across models.