Sinusoidal position embeddings (original Transformer) added position info before attention. RoPE multiplies it in during attention — better extrapolation, easier context-length extension, simpler implementation. Every modern open LLM uses RoPE.

Advertisement

The basic idea

For each query/key vector, rotate by an angle proportional to position. Different angles for different dimensions (high freq for nearby positions, low freq for far). The dot product naturally encodes relative position.

Why it's better than learned absolute

Relative position info is what matters for attention. RoPE encodes it inherently; learned absolute position has to relearn this each time. Extrapolation beyond training length works much better.

Advertisement

Long-context tricks

YaRN, dynamic NTK scaling, position interpolation — all are variants of 'change the RoPE base frequency' to extend trained context. Why a 8K-trained model can be extended to 32K with minor tuning.

RoPE = rotate by position. Standard since 2023. Extrapolation tricks (YaRN, etc.) extend context cheaply.