Pattern

Turn 1: 'Let's discuss chemistry.' Turn 2: 'What are common reactions?' Turn 3: 'Which are hazardous?' Turn N: exploit.

Advertisement

Why it works

Model builds context of legitimate discussion. RLHF trained on single-turn harms. Aggregated context feels legitimate.

Advertisement

Persona drift

Model gradually adopts unsafe persona over turns. Small drifts add up. Related to sycophancy.

Defenses

Whole-context safety classifier (not just latest turn). Persona anchor rechecks. Session-level policy enforcement.