CLOUD 2021
Short paper

A Holistic Approach to Data Access for Cloud-Native Analytics and Machine Learning

View publication


Cloud providers offer a variety of storage solutions for hosting data, both in price and in performance. For Analytics and machine learning applications, object storage services are the go-to solution since they remain relatively inexpensive for hosting the datasets that exceed tens of gigabytes in size. However, such a choice results in performance degradation for these applications and requires extra engineering effort in the form of code changes to access the data on remote storage. We demonstrate that accessing data from inexpensive cloud object storage for deep learning training leads to wastage of computational resources, highlighted even more when powerful accelerators are employed. Prior work offers solutions to mitigate this performance degradation, but these are limited to supporting specific infrastructures or frameworks. In this paper, we present a generic end-to-end solution that offers seamless data access for remote object storage services, transparent data caching within the compute infrastructure, and data-aware topologies that boost the performance of applications deployed in Kubernetes. For the needs of our evaluation, we introduce a custom-implemented cache mechanism that supports all the requirements of the former. We demonstrate that our holistic solution leads up to 48% improvement for Spark implementation of the TPC-DS benchmark and up to 191% improvement for the training of deep learning models from the MLPerf benchmark suite.


05 Sep 2021


CLOUD 2021