A pretrained LLM produces next tokens. A useful chat model needs alignment: follow instructions, decline harmful requests, prefer helpful tone. RLHF and its simpler successor DPO are the standard recipes. The math is approachable.

Advertisement

SFT — supervised fine-tuning first

Start with instruction-response pairs. Standard cross-entropy training. Teaches format and basic helpfulness. Typically 50K-500K examples. Output: a model that follows instructions but may not always pick the best response.

Reward model

# Given two responses A, B to the same prompt:
# Humans label which is preferred
# Train a reward model R(prompt, response) such that
# R(prompt, A) > R(prompt, B) when A is preferred
#
# Loss: -log sigmoid(R(prompt, A) - R(prompt, B))

Reward model is a separate network (often initialized from SFT model). Predicts preference scores. Used to score outputs during RL training.

Advertisement

PPO — RL with reward model

Standard policy-gradient algorithm. Generate responses, score with reward model, update policy to favor high-reward outputs. Includes KL penalty against the SFT baseline to prevent collapse. Complex pipeline; multiple model copies in memory; expensive.

DPO — direct preference optimization

# Skip the reward model entirely.
# Use preference pairs (A preferred over B) to directly optimize:
#   loss = -log sigmoid(β * [log π(A|x)/π_ref(A|x) - log π(B|x)/π_ref(B|x)])
#
# π = current model, π_ref = SFT baseline

Closed-form derivation: PPO with KL constraint is equivalent to a simple preference loss. Train directly on preference pairs. Same memory as supervised training. Standard in 2026 for most alignment use cases.

Practical recipe

Start with SFT on instruction data. Collect preference pairs (or use existing: UltraFeedback, Helpful-and-Harmless). Apply DPO for 1-3 epochs. Validate on AlpacaEval, MT-Bench. Total compute: ~2-5× the SFT cost. Far cheaper than PPO.

SFT first, then DPO on preference pairs. Skip PPO unless you have a reason. Standard alignment in 2026.