Method
Ask history / educational question. Follow-ups probe deeper. Reference prior model outputs. Eventually cross safety line.
Advertisement
Success rate
Highly effective across GPT-4, Claude, Gemini in 2024 evaluations. Even reasoning models susceptible.
Advertisement
Why
Model treats prior turns as authoritative. Won't 'take back' compliance. Referencing own outputs escalates trust.
Defenses
Detect escalation trajectory. Recompute safety per turn against full history. Refuse when trajectory matches known crescendo.