Publication
K8SAIHPCDAY 2023
Talk

Training Foundation Model Workloads on Kubernetes at Scale With MCAD

Abstract

Vela cloud-native AI supercomputer was built to train foundational models on Kubernetes. Different research teams inside IBM Research needed flexibility to use the framework of their choice for instance Pytorch, Ray, or Spark to train foundational models. There was a need to help users queue custom resources of their choice to support experimentation with high-level fault tolerance for training that spans across hundreds of GPUs and runs for weeks or months. In this talk, we describe the role Multi-Cluster App Dispatcher (MCAD) plays in queuing different custom resources required for large-scale AI training and its interplay with the underlying scheduler installed on the target Kubernetes cluster with the enablement of gang priority, gang preemption and fault tolerance in mind.