Disaggregated RDDs: Extending and Analyzing Apache Spark for Memory Disaggregated Infrastructures
Abstract
As the demand for scalable data analytics grows, Apache Spark has become essential in large-scale data processing. With memory costs constituting a significant portion of server expenses, the under-utilization and fragmentation of resources pose a substantial challenge for data center operators reliant on economies of scale. Memory disaggregation emerges as a solution to these challenges, by leveraging remote memory pools to reduce resource fragmentation and under-utilization. Yet, these advantages are not without cost. Disaggregated memory systems introduce increased latency and reduced bandwidth, which can significantly impact job execution latency. This necessitates careful optimization and management strategies to effectively balance the trade-offs between accessibility and performance. This paper introduces Cache-Remote, a custom Apache Spark configuration balancing memory disaggregation benefits with execution efficiency. Cache-Remote uses remote memory for RDD caching and local memory for latency-sensitive tasks. Our work includes a comprehensive evaluation of different memory allocation policies and Spark configurations in a hardware setup designed for memory disaggregation. We expand upon prior work by exploring a range of solutions that cater to varying tolerances for job completion latency, introducing new points to the latency-memory usage Pareto. Notably, our Cache-Remote approach enhances the efficiency of current disaggregated memory allocation strategies. It achieves a substantial reduction in local memory utilization—up to 24.8%—while incurring a minimal execution time overhead of merely 7%, compared to local-only policies.