REINFORCE

∇J(θ) = E[∇log π_θ(a|s) · R]. High variance — use baseline (value function) to reduce.

Advertisement

Actor-critic

Actor: policy π. Critic: value V(s). Advantage A(s,a) = Q(s,a) - V(s) reduces variance.

Advertisement

PPO

Proximal Policy Optimization. Clip policy ratio in gradient update to prevent destructive updates. Modern default.

RLHF pipeline

PPO drives RLHF: train reward model on human preferences, use PPO to optimize policy against reward model. ChatGPT training.