Recently a group of data scientists at Hugging Face created a dataset of curated Bluesky posts. The publication of this data has made a lot of people

Soylent Green is people

submited by

Style Pass

2024-12-12 18:00:02

Recently a group of data scientists at Hugging Face created a dataset of curated Bluesky posts. The publication of this data has made a lot of people very angry and has been widely regarded as a bad move. The dataset contained one million posts from the Bluesky firehose with the intent that this could be a standard dataset to evaluate the effectiveness of various moderation tooling. The dataset was removed within hours of publication, but the damage was already done to the community.

Today I'm going to talk about the nuances involved with AI and dig into the 'danger' of one's posts being in these datasets. I'm also going to cover the social/ethical implications of this dataset, why people cried out against it, and how it frankly could have been handled better.

Before we start this, I want to clarify that I don't have a "pro-AI bias", I want to pick apart this ball of mud and discuss the complicated nuances involved with what AI is, what datasets are, and how the intersection of technology and social pressures is creating ethical dilemmas that society is not yet prepared to handle.