Discovery

10-100 adversarial examples remove refusal training. Even 100 benign examples degrade safety. Concerning fragility.

Advertisement

Implication for open models

Safety is not integral. Bad actors can trivially disable. Different from API-served model.

Advertisement

Defenses

Safety-preserving fine-tuning (SafeInstr, SafeLoRA). Regenerate safety data + include in fine-tune. Track safety benchmarks pre + post.

Policy response

Frontier lab open-source hesitance. Meta partial release only. Debate over openness vs safety.