Efficiently querying archived data using hadoop
Abstract
The need to analyze structured data for various business intelligence applications such as customer churn analysis, social network analysis, telecom network monitoring etc., is well known. However, the potential size to which such data will scale in future will make solutions that revolve around data warehouses hard to scale. As data sizes grow the movement of data from the warehouse to archives becomes more frequent. Current file based archive models make the archived data unusable for any type of insight extraction. In this paper, we present an active archival solution for data warehouses that makes use of Hadoop distributed file system (HDFS) to store the data in an always available and cost-effective manner. We investigate various structured data storage schemes within HDFS and empirical evaluations show that a combination of Universal scheme model and column store is best suited for the active archival solution. © 2010 ACM.