Conference paper

Dynamically optimizing queries over large scale data platforms

View publication


Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain actionable insights from their "big data".Query optimization is still an open challenge in this environmentdue to the volume and heterogeneity of data, comprising both structured and un/semi-structured datasets. Moreover, it has become common practice to push business logic close to the data via userdefined functions (UDFs), which are usually opaque to the optimizer, further complicating cost-based optimization. As a result, classical relational query optimization techniques do not fit well in this setting, while at the same time, suboptimal query plans can be disastrous with large datasets. In this paper, we propose new techniques that take into account UDFs and correlations between relations for optimizing queries running on large scale clusters. We introduce "pilot runs", which execute part of the query over a sample of the data to estimate selectivities, and employ a cost-based optimizer that uses these selectivities to choose an initial query plan. Then, we follow a dynamic optimization approach, in which plans evolve as parts of the queries get executed. Our experimental results show that our techniquesproduce plans that are at least as good as, and up to 2× (4×) better for Jaql (Hive) than, the best hand-written left-deep query plans. © 2014 ACM.