Three volunteers. A couple of weeks of work. That’s what it took to add a language to BigScience BLOOM, the open multilingual language model with no fewer than 176 billion parameters that was released mid-2022. It aimed to become an open and multilingual alternative to GPT-3. In the end, 46 languages from all over the world made it into the dataset BLOOM was trained on. Even relatively small languages like Basque and Catalan managed to be included. Dutch did not. How is that possible?
It all started in 2021. A group of more than 1,000 researchers united in the virtual research collective BigScience. Probably triggered by the capabilities of GPT-3 and concerned about the rise of large language models that were increasingly kept to themselves by big tech companies, they participated from May 2021 in a one-year open research workshop in the field of multilingual large language models.
Funded by the French government and the French-American start-up Hugging Face — one of the hottest companies in the field of AI — they wanted to achieve two things: