On April 24, we released Snowflake Arctic with a key goal in mind — to be truly open. In line with that goal, the Snowflake AI Research team is writ

Snowflake Arctic Cookbook Series: Arctic’s Approach to Data

submited by

Style Pass

2024-04-27 08:30:02

On April 24, we released Snowflake Arctic with a key goal in mind — to be truly open. In line with that goal, the Snowflake AI Research team is writing a series of cookbooks to describe how to pretrain, fine-tune, evaluate, and serve large-scale MoEs such as Arctic. We will share our journey of training the Arctic model, our findings related to sourcing and composing pre-training data, designing MoE architectures, co-designing models with training and inference systems in mind, and methods for fine-tuning and evaluating the models.

Arctic is trained on 3.5 trillion tokens sourced from the public domain, encompassing web content, code & SQL, STEM, and more.

In this blog, we go into the origins of our data sources and the methodologies employed to elevate them to the desired quality. We provide an overview of the approaches we’ve taken to tackle the first three challenges head-on. We describe our strategies for 1) assembling vast quantities of web data, 2) gathering high quality enterprise-focused datasets, and 3) data processing techniques and pipeline enhancements to refine data quality. By sharing an insider’s view of the data sources, techniques, and configurations that have proven successful for us, we aim to provide valuable insights to our readers.

At the end, we offer a sneak peek on how we’ll address the 4th challenge about data composition and curriculum, which we will dive into in an upcoming blog on Arctic data.