Self-attention is permutation-invariant by construction — without positional info, 'dog bites man' and 'man bites dog' are identical. Two main strategies inject position: sinusoidal (original Transformer) and RoPE (modern). The math is different; the goal is the same.

Advertisement

Sinusoidal positional embeddings

PE[pos, 2i]   = sin(pos / 10000^(2i/d))
PE[pos, 2i+1] = cos(pos / 10000^(2i/d))

x_final = embedding + PE

Pairs of dimensions use sin/cos at geometric frequencies. Low i = high frequency (changes fast across positions). High i = low frequency (changes slowly). Fixed, not learned. Added directly to token embeddings before the first layer.

Why sin/cos?

Linear combinations of sin and cos at the same frequency can produce any phase shift. This lets the model express 'attend to token K positions away' as a linear function in attention. The Transformer paper argued this enables extrapolation; in practice, learned absolute positions (BERT) work nearly as well.

Advertisement

RoPE — Rotary Position Embeddings

# For each pair of dims (2i, 2i+1) of Q (or K):
# rotate by angle θ_i * pos
#   where θ_i = 10000^(-2i/d)

q_pos[2i]   = q[2i]   * cos(pos*θ_i) - q[2i+1] * sin(pos*θ_i)
q_pos[2i+1] = q[2i]   * sin(pos*θ_i) + q[2i+1] * cos(pos*θ_i)

Position info applied to Q and K (not V), AFTER linear projection, BEFORE attention. The dot product Q·Kᵀ then naturally encodes relative position. Used by Llama, Mistral, Phi, Qwen — every modern LLM.

Why RoPE wins on extrapolation

Sinusoidal extends to longer contexts by formula, but model didn't see those phase values during training → poor quality. RoPE's relative encoding lets you extend context length by adjusting the base frequency (YaRN, linear/dynamic NTK scaling) — works much better in practice.

Storage and compute cost

Sinusoidal: precomputed N×d table. RoPE: precomputed cos/sin tables of size N×d/2. Both negligible memory. RoPE adds a few element-wise ops per Q/K computation — basically free on CPU and GPU.

Sinusoidal adds position to embedding. RoPE rotates Q/K. RoPE handles long-context extrapolation better; modern LLMs use it.