In-Memory Analytics for Kafka using DuckDB

submited by
Style Pass
2024-07-11 13:00:05

Today’s data scientist has a plethora of options when processing data in Apache Kafka. In terms of stream processing, popular options are Kafka Streams, ksqlDB, and Apache Flink. In terms of real-time analytics, one can choose from Apache Druid, Clickhouse, Apache Pinot, and Apache Doris. One aspect of these systems is that they all support distributed processing, and so are capable of scaling to arbitrary amounts of data. However, the distributed nature of these offerings can lead to operational complexity.

Often a data scientist wants to quickly explore the data to get a better understanding of it without setting up a lot of infrastructure. This has led to the emergence of embedded OLAP databases, which support in-process exploratory data analysis. Databases in this category include Apache DataFusion, chdb, and DuckDB. DuckDB in particular has become especially popular, due to its ease of use and rich feature set, which includes:

The most popular formats for storing data in Kafka are Apache Avro, Protobuf, and JSON, all of which are supported by the Confluent Schema Registry. All of these formats support composing types using records, arrays, and unions. Unions are particularly interesting in the context of Kafka, because they can be used to support multiple types in the same topic. However, while relational databases often support the record type (usually called row type) and the array type, they usually do not support the union type. DuckDB is the only database that I know of that supports the union type, which makes it a natural target for storing and analyzing data which is originally in either Avro, Protobuf, or JSON format.

Leave a Comment