This post is written as an explanation of a misconception I had with transformer embedding when I was getting started. Thanks to Stephen Fowler for the discussion last August that made me realise the misconception, and others for helping me refine my explanation. Any mistakes are my own. Thanks to feedback by Stephen Fowler and JustisMills on this post.
TL;DR: While the token vectors are stored as n-dimensional vectors, thinking of them as points in vector space can be quite misleading. It is better to think of them as directions on a hypersphere, with a size component.
d ( → x 1 , → x 2 ) = | → x 1 − → x 2 | = √ ∑ i ( x 1 i − x 2 i ) 2
d ( → x 1 , → x 2 ) = → x 1 ⋅ → x 2 = | → x 1 | | → x 2 | cos θ 12