There are worries that GenAI systems may run out of fresh data as they scale. Synthetic data is an option, but using AI-generated data to train AI can

Brain Drain: David vs Goliath

submited by
Style Pass
2024-10-18 09:30:19

There are worries that GenAI systems may run out of fresh data as they scale. Synthetic data is an option, but using AI-generated data to train AI can degrade the model's performance. There may be a better solution. Can data quality overcome a loss of data quantity?

One of the most striking things about today's generative AI models is the absolutely enormous amount of data that they train on. Meta wrote, for example, that its Llama 3 model was trained on 15 trillion tokens, which is equal to roughly 44 TERABYTES of disk space. In the case of large language models this usually means terabytes of text from the internet, although the newest generations of multimodal models also train on video, audio, and images.

The internet, like the oceans of planet Earth, has always been viewed as an inexhaustible resource. Not only is it enormous to begin with, but billions of users are adding fresh text, audio, images, and video every day. Recently, however, researchers have begun to examine the impact this data consumption is having.

“In a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources,” write the authors of a paper from the Data Provenance Initiative, a volunteer collective of AI researchers from around the world, including experts from schools like MIT and Harvard, and advisors from companies like Salesforce and Cohere. For some of the largest and most popular collections of open data typically used to train large AI models, as much as 45% has now been restricted. “If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems.”

Leave a Comment