GPT — Decoder-Only LLMs — Belgavi.AI Lab

Architecture

Decoder-only transformer with causal mask. Simple: predict next token given previous. Trained on trillions of tokens.

Advertisement

Kaplan et al 2020: loss ~ N^(-α) · D^(-β) · C^(-γ) in params N, data D, compute C. Chinchilla: parameters + data should scale together.

Advertisement

Emerges at scale: examples in prompt → learn task without weight update. Few-shot learning without fine-tuning.

Base model → instruct model via SFT on instructions → RLHF for preferences. ChatGPT paradigm.