Publication
CNCF-hosted Co-located Events North America 2024
Talk

Incremental GPU Slicing in Action

View publication

Abstract

Large language models are often released as families of models with varying parameter counts and bit widths. To reduce cost, inference services are increasingly relying on dynamic model selection, preferring smaller models when possible. GPU vendors are on a journey to enable dynamic GPU slicing, making it possible for a workload to request a fraction of the compute and memory units in a GPU, and for the slices to be created and destroyed on demand without disrupting existing workloads. The onus is now on Kubernetes. The Device Management Working Group is hard at work to expose these capabilities. While vendor-agnostic slicing APIs do not exist yet, this talk demonstrates that incremental GPU slicing is possible today. We replace the Multi-Instance GPU manager, which only permits partitioning GPUs in bulk, with an open-source incremental-slicing controller without the need for new APIs or changes to the device plugin. Come learn how to achieve incremental slicing in your GPU clusters.