Sedic: Privacy-aware data intensive computing on hybrid clouds

Kehuan Zhang; Xiaoyong Zhou; Yangyi Chen; Xiao Feng Wang; Yaoping Ruan

doi:10.1145/2046707.2046767

CCS 2011

Conference paper

14 Nov 2011

Sedic: Privacy-aware data intensive computing on hybrid clouds

View publication

Abstract

The emergence of cost-effective cloud services offers organizations great opportunity to reduce their cost and increase productivity. This development, however, is hampered by privacy concerns: a significant amount of organizational computing workload at least partially involves sensitive data and therefore cannot be directly outsourced to the public cloud. The scale of these computing tasks also renders existing secure outsourcing techniques less applicable. A natural solution is to split a task, keeping the computation on the private data within an organization's private cloud while moving the rest to the public commercial cloud. However, this hybrid cloud computing is not supported by today's data-intensive computing frameworks, MapRe-duce in particular, which forces the users to manually split their computing tasks. In this paper, we present a suite of new techniques that make such privacy-aware data-intensive computing possible. Our system, called Sedic, leverages the special features of MapReduce to automatically partition a computing job according to the security levels of the data it works on, and arrange the computation across a hybrid cloud. Specifically, we modified MapReduce's distributed file system to strategically replicate data, moving sanitized data blocks to the public cloud. Over this data placement, map tasks are carefully scheduled to outsource as much workload to the public cloud as possible, given sensitive data always stay on the private cloud. To minimize inter-cloud communication, our approach also automatically analyzes and transforms the reduction structure of a submitted job to aggregate the map outcomes within the public cloud before sending the result back to the private cloud for the final reduction. This also allows the users to interact with our system in the same way they work with MapReduce, and directly run their legacy code in our framework. We implemented Sedic on Hadoop and evaluated it using both real and synthesized computing jobs on a large-scale cloud test-bed. The study shows that our techniques effectively protect sensitive user data, offload a large amount of computation to the public cloud and also fully preserve the scalability of MapReduce. © 2011 ACM.

Conference paper