Cast: Tiering storage for data analytics in the cloud

Yue Cheng; M. Safdar Iqbal; Aayush Gupta; Ali R. Butt

doi:10.1145/2749246.2749252

HPDC 2015

Conference paper

15 Jun 2015

Cast: Tiering storage for data analytics in the cloud

View publication

Abstract

Enterprises are increasingly moving their big data analytics to the cloud with the goal of reducing costs without sacrificing application performance. Cloud service providers offer their tenants a myriad of storage options, which while flexible, makes the choice of storage deployment non trivial. Crafting deployment scenarios to leverage these choices in a cost-effective manner-under the unique pricing models and multi-tenancy dynamics of the cloud environment-presents unique challenges in designing cloud-based data analytics frameworks. In this paper, we propose Cast, aCloud Analytics Storage Tiering solution that cloud tenants can use to reduce monetary cost and improve performance of analytics workloads. The approach takes the first step towards providing storage tiering support for data analytics in the cloud. Cast performs offline workload profiling to construct job performance prediction models on different cloud storage services, and combines these models with workload specifications and high-level tenant goals to generate a cost-effective data placement and storage provisioning plan. Furthermore, we build Cast++ to enhance Cast's optimization model by incorporating data reuse patterns and across-jobs interdependencies common in realistic analytics workloads. Tests with production workload traces from Facebook and a 400-core Google Cloud based Hadoop cluster demonstrate that Cast++ achieves 1.21× performance and reduces deployment costs by 51.4% compared to local storage configuration.

Conference paper