Learning the Haystack

submited by
Style Pass
2024-07-20 00:30:04

Embeddings, or vector representations of a document (which could be a piece of text, image, sound, etc.), can be extremely useful for making sense of large datasets. They transform information into a vector space such that their distance corresponds to their similarity.

Enterprising readers might be asking themselves how to get these vectors (also known as embeddings) in the first place. One way is to simply pay for them. This isn’t ideal for a couple of reasons:

Below is an overview of three of the main training regimes I have used for creating embeddings. For more information and in-depth examples, I highly recommend the loss overview page of the Sentence Transformers library. Generally speaking, there are three categories of methods: unsupervised methods, contrastive learning methods (positive/negative labels), and regression methods (floating-point labels).

One approach for vectorizing a document, image, or other blob of information is to simply use an autoencoder. An autoencoder is a function which learns a lossy compression function. It can be considered an unsupervised method since each item is its own label.

Leave a Comment