Fetching bulk data from the PyPi API in bulk is non-trivial, and using the BigQuery dataset requires using BigQuery. The entire package dataset is not

GitHub - orf/pypi-data: Automatically updated PyPi API data, available in bulk via git

submited by
Style Pass
2022-08-13 18:30:10

Fetching bulk data from the PyPi API in bulk is non-trivial, and using the BigQuery dataset requires using BigQuery. The entire package dataset is not large and easily fits into the memory of most developer machines, so it's much more fluid to explore the data with Pandas than the heavyweight (and sometimes expensive) BigQuery.

Each package has a unique directory within release_data/, prefixed with the first two lowercased characters of the package name. Each package has a unique, gzip compressed file containing the full API response for all package releases within it.

PyPi also publishes a serial changelog of events that occur to all packages. These are available in the changelog_data/ directory.

Leave a Comment