Self-attention is permutation-invariant by construction — without positional info, 'dog bites man' and 'man bites dog' are identical. Two main strategies inject position: sinusoidal (original Transformer) and RoPE (modern). The math is different; the goal is the same.
Sinusoidal positional embeddings
PE[pos, 2i] = sin(pos / 10000^(2i/d))
PE[pos, 2i+1] = cos(pos / 10000^(2i/d))
x_final = embedding + PEPairs of dimensions use sin/cos at geometric frequencies. Low i = high frequency (changes fast across positions). High i = low frequency (changes slowly). Fixed, not learned. Added directly to token embeddings before the first layer.
Why sin/cos?
Linear combinations of sin and cos at the same frequency can produce any phase shift. This lets the model express 'attend to token K positions away' as a linear function in attention. The Transformer paper argued this enables extrapolation; in practice, learned absolute positions (BERT) work nearly as well.
RoPE — Rotary Position Embeddings
# For each pair of dims (2i, 2i+1) of Q (or K):
# rotate by angle θ_i * pos
# where θ_i = 10000^(-2i/d)
q_pos[2i] = q[2i] * cos(pos*θ_i) - q[2i+1] * sin(pos*θ_i)
q_pos[2i+1] = q[2i] * sin(pos*θ_i) + q[2i+1] * cos(pos*θ_i)Position info applied to Q and K (not V), AFTER linear projection, BEFORE attention. The dot product Q·Kᵀ then naturally encodes relative position. Used by Llama, Mistral, Phi, Qwen — every modern LLM.
Why RoPE wins on extrapolation
Sinusoidal extends to longer contexts by formula, but model didn't see those phase values during training → poor quality. RoPE's relative encoding lets you extend context length by adjusting the base frequency (YaRN, linear/dynamic NTK scaling) — works much better in practice.
Storage and compute cost
Sinusoidal: precomputed N×d table. RoPE: precomputed cos/sin tables of size N×d/2. Both negligible memory. RoPE adds a few element-wise ops per Q/K computation — basically free on CPU and GPU.