DuckDB is rapidly changing the way data scientists and engineers work. It’s efficient and internally parallelised architecture means that a single q

Lightning fast aggregations by distributing DuckDB across AWS Lambda functions

submited by
Style Pass
2023-03-22 13:30:06

DuckDB is rapidly changing the way data scientists and engineers work. It’s efficient and internally parallelised architecture means that a single querying node often out-competes entire clusters of more traditional query engines.

But out of the box, DuckDB needs to be run on a single node meaning the hardware naturally limits performance. The typical way of solving this in a production environment involves scaling out or scaling up your DuckDB infrastructure: increasing the size of your querying instance or distributing concurrent query jobs to different hardware. Traditionally, scaling out or up are slow and expensive tasks that involve waiting for an upstream provider to allocate resources and start new servers.

The time and cost of scaling DuckDB infrastructure is a good trade-off for many analytical workloads — particularly if you are generating and pre-caching analytics for later consumption. This only works for some businesses in 2023: these days, our users expect more than rigid and overly optimised analytics — they want to dynamically filter, generate trends over long periods, and analyse data in real-time.

Serverless computing presents an opportunity to solve both the cost and cold start problem. AWS Lambda instances are relatively small and underpowered (6 CPU cores, 10GB RAM, max 0.5Gbps network throughput), but have a core benefit of being charged per millisecond and having almost no time/cost overhead when starting, pausing, and resuming instances.

Leave a Comment