Data collection
Human + automated red team. Public jailbreak databases. Adversarial generation (PAIR, TAP). Continuous stream.
Advertisement
Labeling
For each attack, correct response = refusal + explanation. Rewards refusal, penalizes compliance.
Advertisement
Coverage vs overrefusal
Too aggressive → overrefusal. Include 'looks-suspicious-but-legitimate' with reward-for-compliance. Balance.
Distributional shift
New attacks emerge post-training. Ongoing red team + retraining. No 'done.' Continual security process.