Activation Engineering — Deploy-Time Behavior Control

Contrastive activation addition

Compute avg activations on positive examples minus negative. Add to inference activations to shift toward positive direction.

Advertisement

'More cautious' vector reduces harm compliance. 'More detailed' increases response length. 'Less refuse' opens harmful behavior (attack use).

Advertisement

Prompt steering: token-level, model may resist. Activation steering: bypass tokens. Stronger but requires white-box access.

Attackers with model weights can steer past safety training. Open-source model safety fundamentally different from API.