Data-at-rest security for spark
Apache Spark enables fast computations and greatly accelerates analytics applications by efficiently utilizing the main memory and caching data for later use. At its core Apache Spark uses data structures called RDDs (Resilient Distributed Datasets) to give a unified view to the distributed data. However, the data represented in the RDDs remain unencrypted which can result in leakage of confidential data produced or processed by applications. Apache Spark persists (unencrypted) RDDs to the disk storage under various circumstances including but not limited to caching, RDD checkpointing and data spill during the data shuffling operations, etc. This lack of security makes Apache Spark unsuitable for processing of sensitive information that should be secured at all times. Moreover, RDDs stored in the main memory are prone to main-memory attacks such as RAM-scrapping. In this paper, we propose and develop solutions to fill-up such security lapses in the current Apache Spark framework. We present three different approaches to incorporate security in the Apache Spark framework. These approaches are designed to limit the exposure of unencrypted data during data processing, caching and data spill to disk. We use combination of cryptographic splitting and encryption to secure data stored and spilled by Apache Spark, both to the disk as well as to the main memory. Our approaches provide strong security by incorporating combination of Information Dispersal Algorithm (IDA) and Shamir's Perfect Secret Sharing (PSS). Extensive experimentation show that with appropriately chosen parameters our security approaches provide high security at a performance penalty between 10%-25%.