About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
BDC 2015
Conference paper
A Unified Computation Engine for Big Data Analytics
Abstract
Nowadays large enterprises maintain a huge amount of data in multiple backend systems including traditional database systems and recently popular big data systems. In an example of telecom providers, the key business data (e.g., billing information) is maintained in database systems whereas the huge amount of log data is on HDFS with Hive. How to provide insightful analytics on such data becomes a challenging task. Traditional enterprise data warehouse systems with careful database design cannot meet the agile requirement of data scientists to arbitrarily access any useful data (such as the log data). In this paper, we propose a unified computation engine for big data analytics, namely Octopus, to effectively and efficiently bridge data scientists and data warehouse. First, Octopus designs a SQL-alike approach to unify both database queries and machine learning algorithms. Next, Octopus optimizes the running time of such big data analytic tasks by scheduling optimal subtasks to backend systems. A proof-of-concept prototype of Octopus successfully verifies that Octopus can achieve much faster running time than Spark. For example, Octopus outperforms the recent Spark 1.4.0 by 4.58× faster running time to process a complex analytic task, and 5.25× to process a simple aggregation query.