We identify the most important queries for a table based on frequency of execution and business value, then identify the common predicates used by tho

Case Study on How to Choose a Clustering Key Part 2

submited by
Style Pass
2021-07-05 15:30:05

We identify the most important queries for a table based on frequency of execution and business value, then identify the common predicates used by those queries. In some of the examples below, instances of “?” are bind variables which could take different values in different queries. For this case study, we looked at the most frequently-executed queries for the table, including the following:

Query 3: This query is used in the experiment of execution against different warehouse sizes and different table sizes. The result is shown in Table 4-2.

Any columns in the predicates can be included in a clustering key. In the above queries, the most frequently included predicates are as follows:

The EVENTS_BIG table contains a total number of 155B rows with 24TB of data and 1.5 million micro-partitions. Query 3 above is used to demonstrate the process of choosing clustering key columns since query 3 contains the most common predicates. In Query 3, the following predicates are used:

The chart below shows the data distribution of the candidate columns, including the number of distinct values in each column and the number of rows remaining after each predicate is applied as specified in Query 3:

Leave a Comment