Discovery
10-100 adversarial examples remove refusal training. Even 100 benign examples degrade safety. Concerning fragility.
Advertisement
Implication for open models
Safety is not integral. Bad actors can trivially disable. Different from API-served model.
Advertisement
Defenses
Safety-preserving fine-tuning (SafeInstr, SafeLoRA). Regenerate safety data + include in fine-tune. Track safety benchmarks pre + post.
Policy response
Frontier lab open-source hesitance. Meta partial release only. Debate over openness vs safety.