At RevenueCat, we replicate our production data to Snowflake (a cloud-based data warehouse) for data exports, reports, charts and other needs. In this

How we solved RevenueCat’s biggest challenges on data ingestion into Snowflake

submited by

Style Pass

2024-04-16 18:00:03

At RevenueCat, we replicate our production data to Snowflake (a cloud-based data warehouse) for data exports, reports, charts and other needs.

In this blog, I’ll delve into the intricacies of our data management practices, specifically focusing on the journey of our data from its origins to its final destination in Snowflake. We’ll explore the challenges we faced, the solutions we devised, and the insights we gained through the process of optimizing our data ingestion pipeline.

We used some commercial solutions to replicate the Aurora Postgresql WAL to Snowflake for a while, but we ran into many problems. TOAST fields were not properly handled, schema evolution was flaky, our volume was too large, inconsistencies were hard to diagnose, and the general opaqueness of the system was a big problem for us.

We decided to experiment with replacing it with a more standard Debezium + Kafka + S3 to capture the changes and find our own way to load from S3 to Snowflake. Why didn’t we choose the Snowflake Connector for Kafka? We wanted to build a data lake with agnostic parquet files that can be used for other purposes or tools such as Spark.