▶ Interactive Lab

DPO Preference Loss

Direct preference optimization vs reward + PPO.

Advertisement
loss = -log σ(β · (log π(A) - log π(B)) - β · (log π_ref(A) - log π_ref(B)))

What you're seeing

DPO: use preference pairs directly. No reward model. No PPO.

★ KEY TAKEAWAY
DPO: skip the reward model — use preference pairs to directly optimize. The math is a closed-form derivation of PPO + KL constraint.
▶ WHAT TO TRY
  • Slide preferred logp up — loss drops sharply.
  • Slide rejected logp up — loss rises (model has to prefer A over B).
  • β controls how aggressively to follow preference signal.