Deriving RoPE the proper way

submited by
Style Pass
2025-07-31 21:30:07

RoPE has become the de facto positional embedding for transformer models. Its popularity mainly stems from its performance, but the “derivation” in the paper is also quite elegant (but flawed).

Implementing high dimensional RoPE also pushes us to think about generalizing the underlying ideas as much as possible (alongside using signal processing intuition) - there’s code at the end of the post that implements things based on the ideas we develop here.

Upon a closer look, the original derivation is unfortunately not rigorous - the paper solves the problem for 2 head dimensions (not position dimensions), and then proceeds to generalize it to a higher number of (even) dimensions with a similar-looking form. That does provide a solution, but is still incomplete in terms of whether or not there are other solutions. There is another attempt at proof here that talks about completeness, but it makes some assumptions, while being accidentally benign, that leave the proofs incomplete.

So I decided to settle this question, and show that RoPE is actually optimally expressive under these conditions. Well, not quite - a couple of little things make it slightly suboptimal, but increasing the head dimension just works.

Leave a Comment
Related Posts