Get started with ClickHouse Cloud today and receive $300 in credits. To learn more about our volume-based discounts, contact us or visit our pricing p

How we made querying Pandas DataFrames with chDB 87x faster

submited by
Style Pass
2024-08-30 08:00:04

Get started with ClickHouse Cloud today and receive $300 in credits. To learn more about our volume-based discounts, contact us or visit our pricing page.

It’s been almost two years since I started working on chDB—an embedded version of ClickHouse that runs in-process—and six months since chDB joined ClickHouse. In this blog post, I want to share some performance improvements I’ve worked on in the last few months.

When I started building chDB, one of the earliest challenges was building on ClickHouse Local’s support for querying data from various data sources with zero performance loss. One thing to keep in mind is that all input and output with ClickHouse Local is done through file descriptors, as shown in the diagram below.

This isn’t a problem for ClickHouse Local, but it is problematic for an in-process engine like chDB, where we want to work with data read or generated by libraries like Pandas, Numpy, or PyArrow.

The initial version of chDB was designed with simplicity in mind. For handling in-memory data, taking Pandas DataFrame as an example, the initial version of chDB’s DataFrame querying was implemented as follows:

Leave a Comment