This dataset contains Bluesky Social data. By accessing this dataset, you agree to adhere to Bluesky's Terms of Service (https://bsky.social/about/support/tos) and use the data responsibly.
I've removed the data from this dataset since there was a lot of community pushback about its creation/uploading. I will leave the dataset repository up to allow room for discussion of how datasets can be used to help improve Bluesky and allow people to build the tools they need to build their own open models and approaches to creating feeds that work for their needs. Please feel free to continue to leave feedback in the discussions here.
This dataset contains 1 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data.
The with-language-predictions config contains the same data as the default config but with language predictions added using the glotlid model.