Topology-aware GPU scheduling for learning workloads in cloud environments

Marcelo Amaral; Jorda Polo; David Carrera; Seetharami Seelam; Malgorzata Steinder

doi:10.1145/3126908.3126933

SC 2017

Conference paper

12 Nov 2017

Topology-aware GPU scheduling for learning workloads in cloud environments

Download paper

Abstract

Recent advances in hardware, such as systems with multiple GPU and their availability in the cloud, are enabling deep learning various domains including health care, autonomous vehicles, and I ternet of Things. Multi-GPU systems exhibit complex connectivi among GPUs and between GPUs and CPUs. Workload schedule must consider hardware topology and workload communication r quirements in order to allocate CPU and GPU resources for optim execution time and improved utilization in shared cloud enviro ments. This paper presents a new topology-aware workload placeme strategy to schedule deep learning jobs on multi-GPU systems. Th placement strategy is evaluated with a prototype on a Power8 m chine with Tesla P100 cards, showing speedups of up to ≈1.30 compared to state-of-the-art strategies; the proposed algorith achieves this result by allocating GPUs that satisfy workload requirements while preventing interference. Additionally, a large-scale simulation shows that the proposed strategy provides higher resource utilization and performance in cloud systems.

Paper