About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
CASCON 2024
Poster
Modelling Performance and Energy Efficiency of AI and HPC Workloads in Heterogeneous Environments
Abstract
Rapid advancements in artificial intelligence (AI) rely on heterogeneous high performance computing (HPC) resources. HPC systems are expensive and power hungry and are typically used for both AI and traditional HPC workloads. Being able to model and predict performance and energy efficiency characteristics of these workloads in such complex environments is of paramount importance for co-design of cost efficient and sustainable applications and systems. We present our ongoing work on modelling performance and energy efficiency of AI and HPC workloads in heterogeneous environments using a data driven approach. We are developing a software toolkit, which collects runtime performance and power consumption metrics of a workload, combines them with the information about system and application hyperparameters and uses a deep learning (DL) regression model to predict and optimize a workload performance and energy efficiency. The toolkit is based on serval open source technologies and is designed for deployment on hybrid cloud. It offers several unique capabilities compared to other efforts in the area. Some preliminary results on modelling performance and energy efficiency of HPC and AI workloads are included.