# How Much Data is Enough for Finetuning an LLM?

submited by
Style Pass
2024-09-04 16:00:05

There's no shortage of analogies for explaining what an LLM is capable of - one of the best, though, is from this New Yorker article proclaiming it as a "blurry JPEG of the web". This metaphor is particularly useful for capturing many of the technical aspects of what an LLM really "is". In a very quick and handwavey way, when we talk about an LLM, we're typically talking about the two principal components of interest: first, an algorithm that is designed to manipulate a large blob of numbers, and second, the particular arrangement of that blob of numbers.

To make it really, really dumbed down, this large blob of numbers is basically some big structure of values (think lists of lists of numbers, something you could just as easily punch into a JSON file) that, while they started with random values, the values were tweaked ever so carefully by the algorithm such that, when encoding an input text, then running that text through the algorithm and banging it against these numbers iteratively, a single number is punched out the other side, which represents the ith value in some internal dictionary for what the next token would be (e.g. the input text is "I like to eat fruits like an ..." and the guessed word is index 897, or "apple"). The work product, in essence, is the blob of numbers.

These blobs of numbers can be used to do things like the work that ChatGPT does, where the goal is to generate that next ith value, but they can also be paused at a step earlier in the production of that number not to represent the most likely next word, but instead to do a semi-related task of capturing the "essence" of the input text. This is called an embedding, or for these models, more accurately, a sentence embedding. These sentence embeddings look like a long list of numbers, in some fixed length window output (e.g. a five-dimensional LLM output could look like [-0.86, -0.64, 0.42, 0.86, 0.46]). These models typically have hundreds of dimensions - even the smallest heavily used embedder is 384 dimensions. In essence, these numbers represent the "home address" of a text, such that some other text that conveys the same or a very similar meaning would produce a list of numbers that are very close, pairwise (e.g. in a simple example "I like apples" would produce [0.1, 0.3, 0.4], while "I love apples" would produce [0.1, 0.2, 0.4], and "That man is mean" would produce [0.8, 0.6, 0.8]).