Weight availability
Attacker can fine-tune off. Remove safety training. Extract full behavior. Defense-in-depth beyond model.
Advertisement
Prompt stealing irrelevant
System prompt embedded in served model. Attacker can inspect. Don't hide sensitive info there.
Advertisement
Steering vectors work
Activation engineering on open weights disables safety. Public research demonstrates.
Watermarking harder
Attacker removes watermark from weights. Detect only untampered outputs.