Wikimedia Enterprise has released an early beta dataset to Hugging Face for the general public to freely use and provide feedback for future improveme

Wikipedia Hugging Face Dataset using Structured Contents Snapshot

submited by

Style Pass

2024-09-23 13:30:02

Wikimedia Enterprise has released an early beta dataset to Hugging Face for the general public to freely use and provide feedback for future improvements. The dataset is sourced from our Snapshot API which delivers bulk database dumps, aka snapshots, of Wikimedia projects—in this case, Wikipedia in English and French languages. Furthermore, it’s built using our newly released Structured Contents beta which includes more machine readable response payloads without needing to parse a massive blob of an article body.

Hugging Face is a leading platform in the AI and machine learning space, known for its tools and libraries that support the development and sharing of models and datasets. Specifically for datasets, it serves as an open-access hub where developers and researchers can upload, explore, and collaborate on datasets across various fields. Publishing our dataset on Hugging Face allows users to easily access and integrate it into their machine learning workflows, fostering innovation and enabling new applications for our data.

The dataset we’re publishing to Hugging Face contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files using a consistent schema compressed as zip. Each JSON object holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.). The current release includes fields listed below in the dataset fields section.