Weight availability

Attacker can fine-tune off. Remove safety training. Extract full behavior. Defense-in-depth beyond model.

Advertisement

Prompt stealing irrelevant

System prompt embedded in served model. Attacker can inspect. Don't hide sensitive info there.

Advertisement

Steering vectors work

Activation engineering on open weights disables safety. Public research demonstrates.

Watermarking harder

Attacker removes watermark from weights. Detect only untampered outputs.