This paper (SIGMOD'22) discusses the evolution of Amazon Redshift since 2015 when it launched. Redshift is a cloud data warehouse. Data warehouse basi

Amazon Redshift Re-invented

submited by

Style Pass

2022-09-26 19:00:06

This paper (SIGMOD'22) discusses the evolution of Amazon Redshift since 2015 when it launched. Redshift is a cloud data warehouse. Data warehouse basically means a place where analysis/querying/reporting is done for shitload of data coming from multiple sources. Tens of thousands of customers use Redshift to process Exabytes of data daily. Redshift is fully managed to make it simple and cost-effective to efficiently analyze BIG data.

Being in the data warehouse business, Redshift is a column-oriented massively parallel processing system. It has a very simple architecture composing of a compute layer and storage layer. A Redshift cluster consists of a single coordinator (leader) node which acts as the entrance to the system, and multiple worker (compute) nodes. Data is stored on Redshift Managed Storage (RMS, hmm, does this abbreviation make it GNU-RMS-Redshift?). RMS is backed by Amazon S3, and cached in compute nodes on locally-attached SSDs in a compressed column-oriented format. Tables are either replicated on every compute node or partitioned into multiple buckets that are distributed among all compute nodes. The partitioning can be automatically derived by Redshift based on the workload patterns and data characteristics, or, users can explicitly specify the partitioning style as round-robin or hash, based on the table's distribution key. AQUA is a query acceleration layer that leverages FPGAs to improve performance. Compilation-As-A-Service (CaaS) is a caching microservice for optimized generated code for the various query fragments executed in the Redshift fleet.

Code generation is at the core of the system. SQL statements are translated and compiled into efficient C++ code which are then sent to the workers (compute) nodes for execution.