About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
Euro-PAR 2024
Conference paper
Automated Data Management and Learning-based Scheduling for Ray-based Hybrid HPC-Cloud Systems
Abstract
HPC-Cloud hybrid systems are gaining popularity among scientists for their ability to manage sudden demand spikes, resulting in faster turnaround times for HPC workloads. However, deploying workloads on such systems currently requires complicated configuration, particularly for data migration across HPC clusters and Cloud. Additionally, existing schedulers lack support for workload scheduling on such hybrid systems. To address these issues, we have designed and implemented an HPC-Cloud bursting system based on Ray, an open-source distributed framework. Our system integrates automated data management with learning-based scheduling at the function level, using a dynamic label-based design. It automatically prefetches data files based on demand and detects data movement and execution patterns for future scheduling decisions. The developed framework is evaluated with two workloads: machine learning model training and image processing. We compare its performance against naive data fetching under various network speeds and storage locations. Results indicate the effectiveness of our system across all scenarios. The system is open-sourced. The source code and replication packages for reproducing experimental results are provided.