About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
CloudCom 2015
Conference paper
Octopus: Hybrid big data integration engine
Abstract
Nowadays large enterprises maintain a huge amount of data in multiple backend systems including traditional database systems and recently popular big data systems. In an example of telecom providers, the key business data (e.g., billing information) is maintained in database systems whereas the huge signaling log data is on HDFS with Hive. How to integrate such data and provide a consolidate query and analytic becomes a challenging task. Neither traditional database warehouse nor recent Big Data system (e.g. Apache Spark and Hadoop) can fully leverage the power of each backend system. In this paper, we build a hybrid data processing engine, called Octopus, to fully integrate backend systems. Given the backend systems, data is distributed at multiple locations. Octopus focuses on the optimization of the amount of data movement. To this end, Octopus proposes a technique of query pushdown for such optimization. A proof-of-concept prototype of Octopus successfully verifies that Octopus can achieve much faster running time than Spark. For example, Octopus outperforms the recent Spark version 1.4.0 by 5.25 X faster running time to process an aggregation query.