I like DuckDB 🦆. I am excited to see that it is now possible to use it with dplyr using the fantastic duckplyr package which gives us another way t

When duckdb meets dplyr!

submited by

Style Pass

2024-05-10 01:30:04

I like DuckDB 🦆. I am excited to see that it is now possible to use it with dplyr using the fantastic duckplyr package which gives us another way to bridge dplyr with DuckDB.

In this short post, I will show how duckplyr can be used to query parquet files hosted on an S3 bucket. I will use the duckplyr_df_from_parquet() function to read the data and then use dplyr verbs to summarize the data.

Now, we can read the data from the parquet file stored on the S3 bucket with the duckplyr_df_from_parquet() function. We can also specify the class of the output data frame with class = class(tibble()). In this case, I will use tibble.

From what I understand, all the data is pulled into the memory. This could be a problem if the data is too large. What we can do is to summarize the data to know how many rows are in the table.

Let’s try with a much larger dataset. I will use the NYC taxi data from 2019. The data is partitioned by month and stored in parquet partitioning. But before we can process, we need to change the endpoint.