What is actually meant when referring to parquet row-group size?

submited by

Style Pass

2024-05-05 18:30:07

Communities for your favorite technologies. Explore all Collectives

Ask questions, find answers and collaborate at work with Stack Overflow for Teams. Explore Teams

I am starting to work with the parquet file format. The official Apache site recommends large row groups of 512MB to 1GB (here). Several online source (e.g. this one) suggest that the default row group size is 128MB.

I have a large number of parquet files which I will later process downstream with PySpark on AWS Glue. These files have very small row-groups. I cannot control the files I'm starting with, but want to combine row-groups so as to have "more efficient" files prior to downstream processing (why? these files will be uploaded to S3 and processed with Spark; my understanding is that Spark will read one row-group at a time, so many more smaller row-groups results in increased IO operations which is inefficient; if this assumption is invalid please educate me).

Let's consider just one of these files for this question. It's compressed (with snappy compression) and 85MB on disk. When I inspect its schema using the pqrs tool it reports that the file has 55,733 records in 1,115 row groups, and each row group seems to be around 500 kB - specifically, something like this: