Querying every file in every release on the Python Package Index (redux)

submited by
Style Pass
2023-11-20 13:30:06

Seth Larson wrote a great blog post on querying a PyPI dataset to look for trends in the use of memory-safe languages in Python.

Check out Seth’s article for more information on the dataset (and it’s a good read!). It caught our eye because it makes use of DuckDB to clean the data for analysis.

That’s right up our alley here in Ibis land, so let’s see if we can duplicate Seth’s results (and then continue on to plot them!)

Seth showed (and then safely decomposed) a nested curl statement and that’s always viable – we’re in Python land so why not grab the filenames using urllib3?

DuckDB is clever enough to grab only the parquet metadata. This means we can use read_parquet to create a lazy view of the parquet files and then build up our expression without downloading everything beforehand!

Let’s break down what we’re looking for. As a high-level view of the use of compiled languages, Seth is using file extensions as an indicator that a given filetype is used in a Python project.

Leave a Comment