OSSEU 2023

Greening the AI Cloud: Validating Power Models for Kubernetes Containers

View publication


Were you aware that Machine Learning (ML) is responsible for up to 40% of data center energy consumption? ML model training and inference are computationally intensive, with the potential for exponential growth in the coming years. As a result, understanding ML power consumption is critical, especially given the increasing demand for government and industry efforts to reduce greenhouse gas emissions and increasing popularity of Foundation Model systems. The observability of ML power consumption is a crucial area for optimizing power provisioning, capping, and tuning in data centers. In this presentation, we will introduce the Kepler framework, which offers a way to estimate power consumption at the process, container, Kubernetes pod, and job levels. The Kepler framework can be composed of a set of power models that can be utilized in various scenarios, such as different architectures and available metrics. We will provide and demo a focused view on creating a power model for your environment, and we will detail how to validate the accuracy of the power model.


19 Sep 2023


OSSEU 2023