Backdoor pattern

Trigger phrase or feature. Normal input: correct behavior. Trigger present: attacker's chosen behavior.

Advertisement

Web-scale attack

LLMs train on web. Attacker publishes malicious content targeting inclusion. Wallace et al 2020 demonstrated feasibility.

Advertisement

Split-view attack (2024)

Malicious content served only to crawlers, not users. Bypasses human review. Actively found in wild.

RAG poisoning

Poison KB documents. Attacker's payload retrieved on trigger. Powerful because inference-time attack, not training.