REINFORCE
∇J(θ) = E[∇log π_θ(a|s) · R]. High variance — use baseline (value function) to reduce.
Advertisement
Actor-critic
Actor: policy π. Critic: value V(s). Advantage A(s,a) = Q(s,a) - V(s) reduces variance.
Advertisement
PPO
Proximal Policy Optimization. Clip policy ratio in gradient update to prevent destructive updates. Modern default.
RLHF pipeline
PPO drives RLHF: train reward model on human preferences, use PPO to optimize policy against reward model. ChatGPT training.