Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs
Abstract
While Hadoop is the de facto standard big-data middleware, many frameworks have been developed on top of it. Since many SQL-on-Hadoop systems are available, we often consider which engine is best for our queries. We can choose not only query engines but also Java virtual machines (JVMs) as well. As their systems become more complex, however, it is not always true that a single system performs best at any time. Moreover, the performance of a mismatched system may degrade greatly. To exploit the best performance, it is important to know what type of queries are suitable for a system and then to schedule queries for the appropriate system. In this paper, we evaluated the TPC-DS benchmark on a combination of query engines (Spark and Tez) and JVMs (J9 and OpenJDK). We found that using different engines lead to a drawback of over 10 times and that using different JVMs leads to a drawback of 3 times. We also analyzed the characteristics of each combination and then proposed classification models for selecting the best combination of systems with a generated query plan. As a result, we achieved a performance improvement of up to two times in total with the classifier.