SC 2017
Conference paper

Topology-aware GPU scheduling for learning workloads in cloud environments

Download paper


Recent advances in hardware, such as systems with multiple GPU and their availability in the cloud, are enabling deep learning various domains including health care, autonomous vehicles, and I ternet of Things. Multi-GPU systems exhibit complex connectivi among GPUs and between GPUs and CPUs. Workload schedule must consider hardware topology and workload communication r quirements in order to allocate CPU and GPU resources for optim execution time and improved utilization in shared cloud enviro ments. This paper presents a new topology-aware workload placeme strategy to schedule deep learning jobs on multi-GPU systems. Th placement strategy is evaluated with a prototype on a Power8 m chine with Tesla P100 cards, showing speedups of up to ≈1.30 compared to state-of-the-art strategies; the proposed algorith achieves this result by allocating GPUs that satisfy workload requirements while preventing interference. Additionally, a large-scale simulation shows that the proposed strategy provides higher resource utilization and performance in cloud systems.


12 Nov 2017


SC 2017