Daniel van Strien, a machine learning librarian at Hugging Face, took a million Bluesky posts and turned them into a dataset expressly for training AI

Bluesky, AI, and the battle for consent on the open web

submited by
Style Pass
2024-11-27 17:00:03

Daniel van Strien, a machine learning librarian at Hugging Face, took a million Bluesky posts and turned them into a dataset expressly for training AI models:

“This dataset could be used for “training and testing language models on social media content, analyzing social media posting patterns, studying conversation structures and reply networks, research on social media content moderation, [and] natural language processing tasks using social media data,” the project page says. “Out of scope use” includes “building automated posting systems for Bluesky, creating fake or impersonated content, extracting personal information about users, [and] any purpose that violates Bluesky's Terms of Service.””

There was an outcry among users, who felt that they hadn’t consented to such an activity. The idea that a generative AI model could potentially be used to build new content based on users’ work without their participation, consent, or awareness was appalling.

Van Strien eventually saw that his act was a violation and subsequently removed the dataset, writing an apology in a Bluesky post:

Leave a Comment