Attention Is All You Need — Revisited

The 2017 Transformer paper changed deep learning. Nearly a decade later, knowing which parts of the original architecture survived and which got revised is one of the cleanest ways to understand modern LLMs.

Advertisement

What survived

Attention as the core operation: yes. Scaled dot-product: yes. Multi-head: yes (with GQA/MLA variants). Residual connections: yes. Layer norm: yes (with RMSNorm replacement). Position embeddings: yes (RoPE replaced sinusoidal).

What changed

Pre-norm replaced post-norm (training stability at depth). RMSNorm replaced LayerNorm (slightly faster, same quality). SwiGLU replaced ReLU in MLP (better quality). GQA/MLA replaced MHA (KV cache size). RoPE replaced sinusoidal positions (better extrapolation).

Advertisement

What's gone

Encoder-decoder architecture: still relevant for translation/T5 but rare in current LLMs. Single-task pretraining: replaced by next-token prediction. Beam search: replaced by sampling. The original architecture's specifics are mostly archaeological now.

What's added since

MoE (Mixture of Experts). MTP (Multi-Token Prediction). FlashAttention (training/inference speed). Long context via RoPE extensions. Speculative decoding. Each is a meaningful addition to the architecture or its training/inference loop.

Reading the paper today

Still worth reading. The clarity of presentation set a standard. The attention math hasn't changed. The decisions you'd skim past (residuals, layer norm placement, label smoothing) are where the field has iterated most.

Core math survived. Most engineering choices (norm, MLP, positions) got revised. Read the paper; you're seeing the foundation everything else iterates on.

What survived

What changed

What&#x27;s gone

What&#x27;s added since

Reading the paper today

What's gone

What's added since