Scott Horvath's Weblog

submited by
Style Pass
2024-07-18 08:00:06

The State Government of Victoria’s “open data platform” DataVic is promoted as “the place to discover and access Victorian Government open data.”

However, every time I’ve attempted to use it, I’ve struggled to discover any interesting data that is more than a twenty-row Excel table.

Having recently watched a demonstration on how to query Google Gemini Pro’s large context window, I realised that with some web scraping and data preparation, I could create a large text file profiling all available datasets on DataVic. Feeding this text file into Gemini’s context window might allow me to quickly perform discovery analyses of the DataVic library.

I wrote a script that downloaded every available CSV from the DataVic platform and created a “master_data_profile.txt” that contains the profile of every CSV dataset. The profiling for each CSV included:

I had some issues with programmatically opening .xls/x (Excel) files (they were usually small tables anyway), so I decided to limit the analysis to simple CSVs. Some of the CSV APIs referenced by DataVic were also broken.

Leave a Comment