About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
e-Energy 2023
Conference paper
CUFF: A Configurable Uncertainty-driven Forecasting Framework for Green AI Clusters
Abstract
AI applications are driving the need for large dedicated GPU clusters, which are highly energy- and carbon-intensive. To efficiently operate these clusters, operators leverage workload forecasts that inform resource allocation decisions to save energy without sacrificing performance. The traditional forecasting methods provide a single-point forecast and do not expose the uncertainty about their predictions, which can lead to an unexpected loss in performance. In this paper, we present an uncertainty-driven GPU demand forecasting framework that exposes the uncertainty in its predictions and provides a mechanism to configure the trade-off between energy savings and performance. We evaluate our approach using multiple GPU workload traces and demonstrate that the forecasting framework, called CUFF, outperforms state-of-the-art point predictions. CUFF predictor meets performance goals 83% of the time compared to 7.6% for the point predictions under high GPU demand. Furthermore, CUFF knob enables users to configure up to 98% performance target while providing 26% energy savings, comparable value to point forecasts that only ensure 68% performance target.