Joins for hybrid warehouses: Exploiting massive parallelism in hadoop and enterprise data warehouses

Yuanyuan Tian; Tao Zou; Fatma Özcan; Romulo Goncalves; Hamid Pirahesh

doi:10.5441/002/edbt.2015.33

EDBT 2015

Conference paper

23 Mar 2015

Joins for hybrid warehouses: Exploiting massive parallelism in hadoop and enterprise data warehouses

View publication

Abstract

HDFS has become an important data repository in the enterprise as the center for all business analytics, from SQL queries, machine learning to reporting. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. This has created the need for a new generation of special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. There are many applications that require correlating data stored in HDFS with EDW data, such as the analysis that associates click logs stored in HDFS with the sales data stored in the database. All existing solutions reach out to HDFS and read the data into the EDW to perform the joins, assuming that the Hadoop side does not have the efficient SQL support. In this paper, we show that it is actually better to do most data processing on the HDFS side, provided that we can leverage a sophisticated execution engine for joins on the Hadoop side. We identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables. We utilize Bloom filters to minimize the data movement, and exploit the massive parallelism in both systems to the fullest extent possible. We describe a new zigzag join algorithm, and show that it is a robust join algorithm for hybrid warehouses which performs well in almost all cases.

Conference paper