Marcelo Amaral
OSSEU 2023
With GPU servers increasingly shared by containerized DNNs that have highly diverse SLOs of inference delay, we observe an emerging need for a scheduler that, without changing container applications, can dynamically estimate the remaining time of each DNN job, in order to determine which kernel calls should preempt the incumbent DNN inference on a shared GPU. This project presents such a scheduler on top of Kubernetes called DEFT. Our preliminary results show that compared to existing solutions, \name reduces SLO violations, because (1) it allows preempting a DNN inference in kernel-level rather than treating DNN inference as a whole, and (2) it makes preemption decisions based on the remaining time of each competing DNN job, rather than static weight per DNN job or the duration of individual kernel calls.
Marcelo Amaral
OSSEU 2023
Max Bloomfield, Amogh Wasti, et al.
ITherm 2025
Evaline Ju, Kelly Abuelsaad
KubeCon EU 2026
Nikoleta Iliakopoulou, Jovan Stojkovic, et al.
MICRO 2025