I found myself recently whipping up notebooks just to pull huggingface datasets locally, annotate or operate changes and update them again. This happe

Omar Kamali | Founder, CEO, Strategic Technology Advisor

submited by
Style Pass
2024-09-05 02:30:21

I found myself recently whipping up notebooks just to pull huggingface datasets locally, annotate or operate changes and update them again. This happened often enough that I made a cli tool out of it, which I've been using successfully for the last few months.

While huggingface uses open formats, I found the official toolchain relatively low-level and not adapted to quick operations such as what I am doing.

To this regard I'm happy to announce that I released `datapluck`, a tool to export (download) datasets from Huggingface into CSV, TSV, JSON, JSONL, Parquet, Google Sheets and SQLite (SQLite is super cool!). It also allows to import (upload) from any of these formats back into Huggingface. It's perfect for portability of your Huggingface datasets across different media.

The idea behind is to make data wrangling and annotation workflows much easier, as well as automated dataset updates from the command line, quite ideal for CI/CD scenarios. As a bonus, `datapluck`is also a package that exposes two functions `export_dataset` and `import_dataset` to be used in programmatic settings.

Leave a Comment