Safe limits on voltage reduction efficiency in GPUs: A direct measurement approach
Energy efficiency of GPU architectures has emerged as an important aspect of computer system design. In this paper, we explore the energy benefits of reducing the GPU chip's voltage to the safe limit, i.e. Vmin point. We perform such a study on several commercial off-the-shelf GPU cards. We find that there exists about 20% voltage guardband on those GPUs spanning two architectural generations, which, if "eliminated" completely, can result in up to 25% energy savings on one of the studied GPU cards. The exact improvement magnitude depends on the program's available guardband, because our measurement results unveil a program dependent Vmin behavior across the studied programs. We make fundamental observations about the program-dependent Vmin behavior. We experimentally determine that the voltage noise has a larger impact on Vmin compared to the process and temperature variation, and the activities during the kernel execution cause large voltage droops. From these findings, we show how to use a kernel's microarchitectural performance counters to predict its Vmin value accurately. The average and maximum prediction errors are 0.5% and 3%, respectively. The accurate Vmin prediction opens up new possibilities of a cross-layer dynamic guardbanding scheme for GPUs, in which software predicts and manages the voltage guardband, while the functional correctness is ensured by a hardware safety net mechanism.