Efficient and flexible index access in MapReduce
Abstract
A popular programming paradigm in the cloud, MapReduce is extensively considered and used for "big data" analysis. Unfortunately, a great many "big data" applications require capabilities beyond those originally intended by MapReduce, often burdening developers to write unnatural non-obvious MapReduce programs so as to twist the underlying system to meet the requirements. In this paper, we focus on a class of "big data" applications that in addition to MapReduce's main data source, require selective access to one or many data sources, e.g., various kinds of indices, knowledge bases, external cloud services. We propose to extend MapReduce with EFind, an Efficient and Flexible index access solution, to better support this class of applications. EFind introduces a standard index access interface to MapReduce so that (i) developers can easily and flexibly express index access operations without unnatural code, and (ii) the EFind enhanced MapReduce system can automatically optimize the index access operations. We propose and analyze a number of index access strategies that utilize caching, re-partitioning, and index locality to reduce redundant index accesses. EFind collects index statistics and performs cost-based adaptive optimization to improve index access efficiency. Our experimental results, using both realworld and synthetic data sets, show that EFind chooses execution plans that are optimal or close to optimal, and achieves a factor of 2x-8x improvements compared to an approach that accesses indices without optimization.