Marcelo Amaral
OSSEU 2023
HPC applications are increasingly utilizing cloud resources due to their cost-effectiveness. Among these resources, spot compute instances present an opportunity to run applications at deep discounts as compared to on-demand instances. However, they present unique challenges for tightly-coupled HPC applications due to potential interruptions. Traditional parallel programming models like MPI are not inherently fault-tolerant, and existing methods to handle these interruptions are inefficient and require significant programmer effort. In this paper, we present Charm++ as an alternative solution that natively supports fault tolerance, dynamic load balancing, and resource rescaling. We present a tool to run Charm++ applications with a mix of on-demand and spot instances which can detect and efficiently handle spot interruptions. We show that using spot instances can result in up to 60% cost savings for our benchmark application.
Marcelo Amaral
OSSEU 2023
Max Bloomfield, Amogh Wasti, et al.
ITherm 2025
Nikoleta Iliakopoulou, Jovan Stojkovic, et al.
MICRO 2025
Ilias Iliadis
International Journal On Advances In Networks And Services