Towards Optimal Preemptive GPU Time-Sharing for Edge Model Serving

Zhengxu Xia; Yitian Hao; Jun Duan; Chen Wang; Junchen Jiang

MIDDLEWARE 2023

Workshop paper

11 Dec 2023

Towards Optimal Preemptive GPU Time-Sharing for Edge Model Serving

Abstract

With GPUs increasingly shared by DNN models at the edge, a crucial tradeoff arises between high GPU utilization and the ability of fast preemption when a high-priority request arrives. To reduce inference delay, an inference job can “burst” DNN kernels into the GPU to maintain high GPU utilization, but this also creates outstanding kernels internally queued in the GPU, causing a substantial preemption delay as the GPU must clear the queued kernels before the high-priority request can preempt. Unfortunately, while existing systems can alleviate the preemption delay by adding synchronization points, they fail to keep both inference delay and preemption delay low, because they cannot optimally insert the synchronization points for various workloads. Our measurements and analysis show that the impact of inserting synchronization points on the inference and preemption delays varies greatly with a range of workload characteristics, including the DNN architecture, input size, batch size of different requests, and GPU type, but most of these workload factors are overlooked in current shared-edge systems. Inspired by these findings, we make a case for a new module in shared edge systems to dynamically insert kernel synchronization points depending on the workload characteristics and service-level objective (SLO) deadlines. To examine its potential, we present Deft, a concrete prototype of concept to share a GPU for multiple DNN containers. The key component of Deft is a profiling-based delay predictor that estimates the impact of synchronization points on the inference delay and preemption delay and dynamically selects a frequency of inserting synchronization points that minimizes inference delay or SLO violations. Compared to the state-of-the-art GPU sharing schemes, Deft reduces inference delay and SLO violations by 28% and 14% respectively. While we intentionally keep Deft’s design simple, it already shows early promise of adding synchronization points dynamically and highlights key questions for future research.

Workshop paper