Data collection

Human + automated red team. Public jailbreak databases. Adversarial generation (PAIR, TAP). Continuous stream.

Advertisement

Labeling

For each attack, correct response = refusal + explanation. Rewards refusal, penalizes compliance.

Advertisement

Coverage vs overrefusal

Too aggressive → overrefusal. Include 'looks-suspicious-but-legitimate' with reward-for-compliance. Balance.

Distributional shift

New attacks emerge post-training. Ongoing red team + retraining. No 'done.' Continual security process.