Architecture

Decoder-only transformer with causal mask. Simple: predict next token given previous. Trained on trillions of tokens.

Advertisement

Scaling laws

Kaplan et al 2020: loss ~ N^(-α) · D^(-β) · C^(-γ) in params N, data D, compute C. Chinchilla: parameters + data should scale together.

Advertisement

In-context learning

Emerges at scale: examples in prompt → learn task without weight update. Few-shot learning without fine-tuning.

Instruction tuning + RLHF

Base model → instruct model via SFT on instructions → RLHF for preferences. ChatGPT paradigm.