This post is a technical deep-dive into our initial text data curation results. For a brief, high-level summary of this work, we direct readers to our short companion piece .
The themes and some of the content of this post are similar to our Image-Text Curation Deep-Dive , but here we show that we can also obtain substantial gains applying our data curation pipeline to LLMs.
We share the results of applying our data curation pipeline to text data. Our pipeline is scalable, productionized, and integrates a suite of approaches including model-based filtering, embedding-based filtering, synthetic data, and more. We applied our curation pipeline to RedPajama V1 (RPJv1) and obtained substantial improvements in large language model (LLM) quality, training speed, and inference efficiency for model sizes of up to 2.7B parameters and training budgets of up to 180B tokens.
We trained standard MPT -style transformer models on our curated data (which we refer to as DAIT for brevity), on RPJv1, and on other publicly-available high quality pretraining corpora such as RefinedWeb, FineWeb, FineWeb-Edu, and DataComp-LM (DCLM). We compared their performance across 15 benchmark language model evaluation datasets. By using our pipeline to curate training data, we are able to: