Analyzing multi-gigabyte JSON files locally

submited by
Style Pass
2023-03-16 09:00:04

I’ve had the pleasure of having had to analyse multi-gigabyte JSON dumps in a project context recently. JSON itself is actually a rather pleasant format to consume, as it’s human-readable and there is a lot of tooling available for it. JQ allows expressing sophisticated processing steps in a single command line, and Jupyter with Python and Pandas allow easy interactive analysis to quickly find what you’re looking for.

However, with multi-gigabyte files, analysis becomes quite a lot more difficult. Running a single jq command will take a long time. When you’re ~trial-and-error~iteratively building jq commands as I do, you’ll quickly grow tired of having to wait about a minute for your command to succeed, only to find out that it didn’t in fact return what you were looking for. Interactive analysis is similar. Reading all 20 gigabyte of JSON will take a fair amount of time. You might find out that the data doesn’t fit into RAM (which it well might, JSON is a human-readable format after all), or end up having to restart your Python kernel, which means you’ll have to endure the loading time again.

Of course, there’s cloud-based offerings that are based on Apache Beam, Flink and many others. However, customer data doesn’t go on cloud services on my authority, so that’s out. Setting up an environment like Flink locally is doable, but a lot of effort for a one-off analysis.

Leave a Comment