Publication
NeurIPS 2023
Workshop

PALM: Adaptive Resource Allocation for Datacenter Power Capping

Abstract

Energy efficiency is pressing in today’s cloud datacenters. Various power management strategies, such as oversubscription, power capping, and dynamic voltage and frequency scaling, have been proposed and are in use by datacenter operators to better control power consumption at any management unit (e.g., node-level or rack-level) without breaking power budgets. In addition, by gaining more control over different management units within a datacenter (or across datacenters), operators are able to shift the energy consumption either spatially or temporally to optimize carbon footprint based on the spatio-temporal patterns of carbon intensity. The drive for automation has resulted in the exploration of learning-based resource management approaches. In this work, we first systematically investigate the impact of power capping on datacenter workloads and learning-based resource management solutions (i.e., reinforcement learning or RL). We show that even power capping leads to an 18% degradation in resource management effectiveness (i.e., defined by an RL reward function) and thus 50% higher application latency. We then propose PALM, an adaptive resource allocation framework that provides graceful performance-preserving transition under power capping for latency-critical workloads. Evaluation results show that PALM achieves 10.2–99.3% improvement in SLO preservation under power capping while saving 3.1–5.8% utilization.