How much LLM training data is there, in the limit?

submited by
Style Pass
2024-05-10 08:30:06

Recent large language models such as Llama 3 and GPT-4 are trained on gigantic amounts of text. Will it be possible to get more training data for the next generation of models? To try to answer that, here it is my best estimate of all the text that exists in the world.

At 15 trillion tokens, current LLM training sets are pretty close to using all available high-quality English text. Possibly you could get to 25 – 30T using less accessible sources (e.g. more books, transcribed audio, Twitter). Adding non-English data you might get close to 60T. That seems like the upper limit.

Private data is much larger. Facebook posts alone likely come to 140T, Google has around 300T tokens in Gmail, and with all private data everywhere you could maybe reach 2,000 trillion tokens. This data seems clearly off limits to responsible private actors, though it’s worth keeping in mind that it exists. It’s potentially an option open to an intelligence agency or another nefarious actor1.

On the assumption that (A) commercial model makers can’t train on private data and (B) more data is required for further performance gains, then future progress must rely heavily on synthetic data. This conclusion will shock nobody who’s been paying attention, but I still found it useful to work through exactly where the limits of the current approach are.

Leave a Comment