A Holistic Approach to Data Access for Cloud-Native Analytics and Machine Learning
- Panos Koutsovasilis
- Srikumar Venugopal
- et al.
- 2021
- CLOUD 2021
Seamless access to data is critical for both foundation models and big science workflows. While Kubernetes has simplified application development and deployment in cloud, specifying the location and access permissions for data, and ensuring required data access performance are still pain points for users.
To meet these challenges, we developed Datashim, a Kubernetes framework that provides abstractions for frictionless and performant data access. Datashim is an LF AI&Data incubation project, and was developed as part of the EVOLVE H2020 project funded by the EU’s H2020 research program.
Cloud deployments typically use object storage buckets for storing data that have very different access methods compared to the commonly known concepts of files and directories. However, many analytics tools still rely on files and directories, making it difficult for them to use object storage.
Datashim solves this problem by providing a new abstraction called a Dataset. Workflows can define how they want to consume the data in the Dataset, such as files-and-directories, and Datashim will make it available that way even if the source is a bucket.
This can greatly simplify data pipelines. For example, a team from European Bioinformatics Institute had a genomic analysis pipeline that used 2 filesystems, object storage, and a number of PVCs, that all had to be manually managed. By upgrading their pipeline to use Datashim they avoided manual management of PVCs and only needed object storage. This also improved their I/O performance, and security by reducing proliferation of bucket access credentials.
Other projects such as Renku.io and Machine Learning Exchange have also adopted the Dataset abstraction to simplify data access in Jupyter notebooks. We are now applying Datashim to simplify data management in multi-tenant foundation model training deployments.
In this project we also investigated improving performance of data access by provisioning and managing caches transparently for object storage data in Kubernetes clusters. Coupled with intelligent scheduling, we discovered that this led to 50% improvement for Spark SQL query performance, and a 190% improvement in training deep learning networks without needing any modification to deployments.