Publication
KubeCon EU 2024
Invited talk

Unleashing the Power of DRA (Dynamic Resource Allocation) for Just-in-Time GPU Slicing

Abstract

AI/ML experts leveraging Kubernetes clusters to train, fine-tune, or serve large language models (LLMs) would like to dynamically allocate GPUs and GPU slices based on the demand of their workloads. The DRA (Dynamic Resource Allocation) approach currently developed by the community is promising but will require changes to Kubernetes scheduling mechanisms with the introduction of latency-inducing roundtrips between schedulers and DRA controllers. Moreover, GPU slices have to be requested by means of novel resource classes and claims, requiring users to adapt. This talk demonstrates how we exploit DRA today to enable just-in-time GPU slicing on large production Kubernetes clusters running a mix of small fractional and large distributed workloads. InstaSlice acts on queued AI workloads to slice GPUs with the help of DRA. By augmenting DRA with InstaSlice, we make it simple for users to leverage DRA with zero changes to queued workloads and zero changes to Kubernetes schedulers.

Date

Publication

KubeCon EU 2024