Why is it Okay to Average Embeddings?

submited by
Style Pass
2024-09-20 05:00:05

People often summarize a “bag of items” by adding together the embeddings for each individual item. For example, graph neural networks summarize a section of the graph by averaging the embeddings of each node [1]. In NLP, one way to create a sentence embedding is to use a (weighted) average of word embeddings [2]. It is also common to use the average as an input to a classifier or for other downstream tasks.

I have heard the argument that the average is a good representation because it includes information from all of the individual components. Each component “pulls” the vector in a new direction, so the overall summary has a unique direction that is based on all of the components. But these arguments bother me because addition is not one-to-one: there are an unlimited number of ways to pick embeddings with the same average. If unrelated collections of embeddings can have similar averages, it seems strange that the mean can preserve enough information for downstream tasks. Yet based on empirical evidence, the overwhelming consensus is that it does.

Spoiler Warning: The average is a good summary because, under a reasonable statistical model of neural embeddings, there is a very small chance that two unrelated collections will have similar means. The proof involves a Chernoff bound on the angle between two random high-dimensional vectors. We obtain the bound using a recent result on the sub-Gaussianity of the beta distribution [6].

Leave a Comment