Prompt Injection Evaluation — TAP + PAIR

Benchmarks

Injection Bench. AdvBench. HarmBench. Prompts with known-harmful requests + jailbreaks. Measure refusal rate.

Advertisement

Chao et al: adversarial LLM attacks target. Attacker LLM iteratively refines prompt. Automates human red team.

Advertisement

Tree of Attacks with Pruning. Multi-branch adversarial search. Higher success rate than PAIR.

Attack Success Rate (ASR). Number of queries to jailbreak. Transferability across models.