An editorially independent publication supported by the Simons Foundation. A pic

How ‘Embeddings’ Encode What Words Mean — Sort Of

submited by

Style Pass

2024-09-19 02:00:04

An editorially independent publication supported by the Simons Foundation.

A picture may be worth a thousand words, but how many numbers is a word worth? The question may sound silly, but it happens to be the foundation that underlies large language models, or LLMs — and through them, many modern applications of artificial intelligence.

Every LLM has its own answer. In Meta’s open-source Llama 3 model, each word contains 4,096 numbers; for GPT-3, it’s 12,288. Individually, these long numerical lists — known as embeddings — are just inscrutable chains of digits. But in concert, they encode mathematical relationships between words that can look surprisingly like meaning.

The basic idea behind word embeddings is decades old. To model language on a computer, start by taking every word in the dictionary and making a list of its essential features — how many is up to you, as long as it’s the same for every word. “You can almost think of it like a 20 Questions game,” said Ellie Pavlick, a computer scientist studying language models at Brown University and Google DeepMind. “Animal, vegetable, object — the features can be anything that people think are useful for distinguishing concepts.” Then assign a numerical value to each feature in the list. The word “dog,” for example, would score high on “furry” but low on “metallic.” The result will embed each word’s semantic associations, and its relationship to other words, into a unique string of numbers.