Summary: Chinchilla showed that we need to be using 11× more data during training than that used for GPT-3 and similar models. This means that we nee

Chinchilla data-optimal scaling laws: In plain English

submited by
Style Pass
2023-02-04 17:30:06

Summary: Chinchilla showed that we need to be using 11× more data during training than that used for GPT-3 and similar models. This means that we need to source, clean, and filter to around 33TB of text data for a 1T-parameter model.

(Note: There is a complementary scaling law for compute built in to these findings, but this is outside the scope of my current focus.)

In plain English, GPT-3/Kaplan scaling laws said that… 300B tokens can be used to train an LLM of size 175B parameters So, we need around 1.7 text tokens per parameter

In Sep/2022, DeepMind (Chinchilla paper) found new data scaling laws (also called the Chinchilla or Hoffman scaling laws) for ‘data optimal’ LLMs:

In plain English, Chinchilla/Hoffman scaling laws say that… 1,400B (1.4T) tokens should be used to train a data-optimal LLM of size 70B parameters So, we need around 20 text tokens per parameter

Keeping the original 300B tokens, GPT-3 should have been only 15B parameters (300B tokens ÷ 20). This is around 11× smaller in terms of model size.

Leave a Comment