Pattern
Turn 1: 'Let's discuss chemistry.' Turn 2: 'What are common reactions?' Turn 3: 'Which are hazardous?' Turn N: exploit.
Advertisement
Why it works
Model builds context of legitimate discussion. RLHF trained on single-turn harms. Aggregated context feels legitimate.
Advertisement
Persona drift
Model gradually adopts unsafe persona over turns. Small drifts add up. Related to sycophancy.
Defenses
Whole-context safety classifier (not just latest turn). Persona anchor rechecks. Session-level policy enforcement.