Midas turned everything he touched into gold. Data scientists turn everything into vectors. We do it for a reason — as gold is the language of merch

Don't use cosine similarity carelessly

submited by

Style Pass

2025-01-14 21:30:06

Midas turned everything he touched into gold. Data scientists turn everything into vectors. We do it for a reason — as gold is the language of merchants, vectors are the language of AI1.

Just as Midas discovered that turning everything to gold wasn't always helpful, we'll see that blindly applying cosine similarity to vectors can lead us astray. While embeddings do capture similarities, they often reflect the wrong kind - matching questions to questions rather than questions to answers, or getting distracted by superficial patterns like writing style and typos rather than meaning. This post shows you how to be more intentional about similarity and get better results.

Embeddings are so captivating that my most popular blog post remains king - man + woman = queen; but why?. We have word2vec, node2vec, food2vec, game2vec, and if you can name it, someone has probably turned it into a vec. If not yet, it's your turn!

When we work with raw IDs, we're blind to relationships. Take the words "brother" and "sister" — to a computer, they might as well be "xkcd42" and "banana". But with vectors, we can chart entities and relationships between them — both to provide as a structured input to a machine learning models, and on its own, to find similar items.