Contrastive activation addition
Compute avg activations on positive examples minus negative. Add to inference activations to shift toward positive direction.
Advertisement
Example steering
'More cautious' vector reduces harm compliance. 'More detailed' increases response length. 'Less refuse' opens harmful behavior (attack use).
Advertisement
Vs prompting
Prompt steering: token-level, model may resist. Activation steering: bypass tokens. Stronger but requires white-box access.
Safety implication
Attackers with model weights can steer past safety training. Open-source model safety fundamentally different from API.