Architecture
Decoder-only transformer with causal mask. Simple: predict next token given previous. Trained on trillions of tokens.
Advertisement
Scaling laws
Kaplan et al 2020: loss ~ N^(-α) · D^(-β) · C^(-γ) in params N, data D, compute C. Chinchilla: parameters + data should scale together.
Advertisement
In-context learning
Emerges at scale: examples in prompt → learn task without weight update. Few-shot learning without fine-tuning.
Instruction tuning + RLHF
Base model → instruct model via SFT on instructions → RLHF for preferences. ChatGPT paradigm.