Query optimization is at the core of any data management/analytics system. It is the process of determining the best way to execute an input query or

Databloom Engineering: Towards a Learning-based Query Optimizer

submited by
Style Pass
2022-05-14 09:00:08

Query optimization is at the core of any data management/analytics system. It is the process of determining the best way to execute an input query or task (i.e., execution plan). Query optimization is composed of several three sub-processes: (i) The enumeration of the different execution plans, (ii) the cost of each subplan required to determine which one is the best, (iii) the cardinality estimation of subplans (i.e., how many tuples a subplan will output) which is crucial because it affects the cost of the plan. Recent research in the field of data management has begun to leverage the power of machine learning (ML) to solve these tasks more effectively and efficiently. In this blog post, we will focus on using ML for estimating the cost of subplans. 

Traditional optimizers come with a cost model. This means mathematical formulas that encode the cost of each operator and aggregate these costs to estimate the cost of a query plan. However, coming up with a cost model in a federated setting, as the one Blossom is built for, is not only very challenging but may also lead to suboptimal performance. There are several reasons for that: (i) Cost-based optimizers assume linear functions which do not depict the real system behaviour, (ii) they require access to statistics stored on the several platforms which may not be possible, and (iii) they need fine-tuning to really model the system behaviour which can be very time-consuming, yet very important. The plot below shows up to an order of magnitude better performance with a well-tuned cost model.

Leave a Comment