You could have designed state of the art positional encoding

submited by

Style Pass

2024-11-17 21:00:05

Gall's Law A complex system that works is invariably found to have evolved from a simple system that worked John Gall

This post walks you through the step-by-step discovery of state-of-the-art positional encoding in transformer models. We will achieve this by iteratively improving our approach to encoding position, arriving at Rotary Postional Encoding (RoPE) used in the latest LLama 3.2 release and most modern transformers. This post intends to limit the mathematical knowledge required to follow along, but some basic linear algebra, trigonometry and understanding of self attention is expected.

As with all problems, it is best to first start with understanding exactly what we are trying to achieve. The self attention mechanism in transformers is utilized to understand relationships between tokens in a sequence. Self attention is a set operation, which means it is permutation invariant (order does not matter). If we do not enrich self attention with positional information, many important relationships are incapable of being determined.

Intuitively, these have very different meanings. Let's see what happens if we first tokenize them, map to the real token embeddings of Llama 3.2 1B and pass them through torch.nn.MultiheadAttention.