A Unified Computation Engine for Big Data Analytics
Abstract
Nowadays large enterprises maintain a huge amount of data in multiple backend systems including traditional database systems and recently popular big data systems. In an example of telecom providers, the key business data (e.g., billing information) is maintained in database systems whereas the huge amount of log data is on HDFS with Hive. How to provide insightful analytics on such data becomes a challenging task. Traditional enterprise data warehouse systems with careful database design cannot meet the agile requirement of data scientists to arbitrarily access any useful data (such as the log data). In this paper, we propose a unified computation engine for big data analytics, namely Octopus, to effectively and efficiently bridge data scientists and data warehouse. First, Octopus designs a SQL-alike approach to unify both database queries and machine learning algorithms. Next, Octopus optimizes the running time of such big data analytic tasks by scheduling optimal subtasks to backend systems. A proof-of-concept prototype of Octopus successfully verifies that Octopus can achieve much faster running time than Spark. For example, Octopus outperforms the recent Spark 1.4.0 by 4.58× faster running time to process a complex analytic task, and 5.25× to process a simple aggregation query.