IntegrityMR: Integrity assurance framework for big data analytics and management applications
Abstract
Big data analytics and knowledge management is becoming a hot topic with the emerging techniques of cloud computing and big data computing model such as MapReduce. However, large-scale adoption of MapReduce applications on public clouds is hindered by the lack of trust on the participating virtual machines deployed on the public cloud. In this paper, we extend the existing hybrid cloud MapReduce architecture to multiple public clouds. Based on such architecture, we propose IntegrityMR, an integrity assurance framework for big data analytics and management applications. We explore the result integrity check techniques at two alternative software layers: the MapReduce task layer and the applications layer. We design and implement the system at both layers based on Apache Hadoop MapReduce and Pig Latin, and perform a series of experiments with popular big data analytics and management applications such as Apache Mahout and Pig on commercial public clouds (Amazon EC2 and Microsoft Azure) and local cluster environment. The experimental result of the task layer approach shows high integrity (98% with a credit threshold of 5) with non-negligible performance overhead (18% to 82% extra running time compared to original MapReduce). The experimental result of the application layer approach shows better performance compared with the task layer approach (less than 35% of extra running time compared with the original MapReduce). © 2013 IEEE.