Vaswani et al. 2017 introduced the Transformer — an architecture that uses attention instead of recurrence. Eight years later, every major LLM (GPT, Claude, Llama, Gemini) is descended from it. Here's what the paper actually proved and what evolved since.

Advertisement

The big insight

RNNs process tokens sequentially — slow and gradient-fragile over long sequences. Attention lets every token directly look at every other token in a single matrix multiplication. Parallelizable on GPUs. Long-range dependencies become first-class.

Self-attention math

Q = X @ W_q   (queries)
K = X @ W_k   (keys)
V = X @ W_v   (values)
Attention(Q,K,V) = softmax(Q @ K^T / sqrt(d_k)) @ V
Advertisement

Multi-head attention

Run N parallel attention operations with different W matrices. Each 'head' captures a different relationship (syntactic, semantic, positional). Concatenate, project back. ~10x richer representation than single-head.

What's evolved since 2017

Positional encoding (originally sinusoidal) → RoPE (rotary). Decoder-only became dominant (GPT line). Sparse attention variants (Longformer, Mistral SWA) for long context. FlashAttention for memory-efficient implementation. Mixture-of-Experts for parameter scaling without compute scaling.

What's still the same

Q/K/V/multi-head/feed-forward block structure is essentially unchanged. The 2017 paper is still ~80% applicable. The big innovations are around it (training data, RLHF, scale) not in it.

Self-attention + multi-head + feed-forward = Transformer block. Eight years of progress, same backbone.