Contrastive activation addition

Compute avg activations on positive examples minus negative. Add to inference activations to shift toward positive direction.

Advertisement

Example steering

'More cautious' vector reduces harm compliance. 'More detailed' increases response length. 'Less refuse' opens harmful behavior (attack use).

Advertisement

Vs prompting

Prompt steering: token-level, model may resist. Activation steering: bypass tokens. Stronger but requires white-box access.

Safety implication

Attackers with model weights can steer past safety training. Open-source model safety fundamentally different from API.