When we started this work, the Kubernetes ecosystem still had significant gaps for large-scale and high-performance AI workloads. One early area of focus became how to expose infrastructure capabilities, like network resources, to the workload without incurring additional overheads. To this end, we created a multi-NIC CNI operator that configures the underlying network interfaces and cuts network latency in half, as it eliminates encapsulation, and increases bandwidth by seven times compared to the out-of-the-box container networking solution. These improvements are completely transparent to the end user.
The second gap we sought to fill was employing the right cloud-native job scheduler. With so many AI developers wanting to submit jobs to run on Vela, we needed a scheduler to allocate resources and prioritize jobs to maximize resource utilization. To solve this problem, IBM researchers created the multi-cluster app dispatcher (MCAD), which provides job queueing, job priorities and preemption, timeouts, and orchestration of resource sharing among the users of the system. In addition, we enabled workload packing and gang scheduling to eliminate resource fragmentation, all running on top of OpenShift. We further developed InstaScale, which works with MCAD to dynamically scale cloud-hosted OpenShift clusters. By automatically acquiring and releasing GPUs on demand from the cloud provider, InstaScale frees practitioners from worrying about infrastructure management and cost.
To make it simple and efficient to run all the steps in the AI pipeline, we have focused on leveraging, and contributing to, two key open-source technologies, PyTorch and Ray. With Ray, we enable scalable data preprocessing (such as filtering data using hate, abuse, and profanity filters) and post processing steps (like model fine-tuning and validation) with a data scientist-friendly Python API. By running Ray with MCAD, we support efficient sharing of resource pools by heterogeneous Ray jobs running concurrently.
We are collaborating with PyTorch to advance support for distributed training, including evolving support for Fully Sharded Data Parallel (FSDP) training APIs through the introduction of rate_limiter. We recently demonstrated efficient scaling of distributed training jobs for models with 10B+ parameters over Ethernet-based environments like Vela in IBM Cloud. And, by integrating MCAD with TorchX, a universal job launcher for PyTorch applications, we are able to transparently support a wide range of PyTorch-based jobs using different APIs and frameworks. All of these diverse jobs benefit from the underlying job management system without requiring code modifications on the part of the AI practitioner.
The training portion of the workflow itself occurs in several steps: model exploration (usually a scaled-down experiment run with a few GPUs), scaling the distributed training job (consuming hundreds of GPUs), and finally, model validation. Orchestrating these steps can be complex for many AI practitioners, and time is lost configuring and managing them. We addressed this challenge through Project CodeFlare, which provides a guided, simplified user experience to efficiently train, test, and monitor the model training life cycle.
CodeFlare CLI (which is console and UI based) guides users through the complexities of running against a remote OpenShift cluster while automating job configuration, storage set up, logging, and endpoints for monitoring and profiling. CodeFlare SDK (which is Jupyter based) provides users with an intuitive Python interface for batch resource requesting, job submission, and observation. With these features, we significantly lower the barrier of entry to a cloud-native stack for our AI research colleagues.
By the end of 2022, all of IBM’s foundation model training work transitioned to running with this software stack on Vela in IBM Cloud. Today, MCAD manages the queue of these AI jobs, from single-GPU jobs to those leveraging more than 512 GPUs, and handles job prioritization and quota management. Along this journey, we discovered additional ways we could make life easier for teams managing OpenShift clusters in GPU-centric environments like Vela, e.g., by enhancing the OpenShift Installer Provisioned Infrastructure (IPI) to make it easier to deploy and manage OpenShift on high-performance infrastructure.
Training and validating state-of-the-art foundation models are the critical early stages of the AI value chain, but true value is ultimately captured when models get put to productive use in the tuning and inferencing steps of the AI workflow. Our software stack for inference and model tuning is focused on executing models efficiently on the underlying hardware, batching incoming requests in an optimal way, simplifying the integration of AI into applications, and providing state-of-the-art techniques for model adaptation. The right side of Figure 1 above depicted our foundation model tuning and serving stack, which is described in more detail below.
Software libraries for optimizing the way foundation models run on a given hardware platform can improve throughput and latency by 10-100x. Our serving stack includes a curated set of mature optimization paths (including ONNX and Hugging Face Optimum) for inferencing common model architectures and is extensible to accommodate new inferencing servers or optimizations as they emerge. Extensibility is a key design point for our stack, considering the rapid pace of innovation in the AI and open-source communities. In addition, real AI services receive a high volume of inference requests from multiple users, which may target multiple models in parallel. Our serving stack dynamically batches incoming requests and efficiently multiplexes between models by building upon, and contributing back, to the Hugging Face, Kserve, and Model Mesh communities.
Inference servers available today for running AI models require significant AI-specific knowledge on the part of the user. The input to the model is a tensor, and the output to the model is a tensor. This format is not approachable for application developers looking to leverage these models to accomplish a task. To make this process more developer friendly, the model output must be converted into something more consumable. We have created an abstraction layer, called Caikit, that provides intuitive APIs and data models for application developers, and provides a stable interface that allows the model and application to evolve independently. This abstraction is used in IBM’s Watson model serving infrastructure and will soon be contributed to open-source.
One of the key value propositions of foundation models is the ability to leverage a pre-trained base model, and “tune” or “adapt” it using specialized data to improve its performance for a downstream task. Our goal is to package state-of-the-art techniques for compute-efficient model adaptation and make them easy to use with little knowledge about how they work. Our extensible stack currently supports Multi-task Prompt Tuning (MPT) and fine tuning, integrated through an open-source project called PEFT (Parameter Efficient Fine Tuning). Over the next few months, we will be open-sourcing a number of our prompt tuning algorithms and implementations.
IBM Research is working with Red Hat to enable others to benefit from this work by contributing the capabilities we’ve developed to key open-source communities and directly to Open Data Hub (ODH). ODH is a comprehensive collection of open-source tools designed to leverage the strengths of OpenShift to facilitate the entire AI development lifecycle. Many of the technologies introduced in Open Data Hub mature to become part of Red Hat OpenShift AI and serve as the middleware base for watsonx.ai. Figure 2 shows how the various contributions to open source described in this blog will come together into ODH to support foundation model use cases.
Reimagining our end-to-end software stack for the foundation models era has had considerable value for our AI community. AI researchers no longer need very detailed infrastructure knowledge to get jobs to run with high performance. They no longer need to figure out how to scale jobs from a few GPUs to hundreds, or how exactly to distribute the jobs to achieve high workload performance — these tasks are handled by the software stack. Code is reusable across teams, and experiments are easily reproducible by others. We’ve also considerably simplified how AI developers can serve and tune foundation models with high compute efficiency and in a developer-friendly manner.
Perhaps most importantly, building this stack on OpenShift provides portability to other environments so our partners can leverage these capabilities on-premises and in any public cloud. Together with Red Hat, we are excited to bring these innovations to the open-source community through the Open Data Hub, advance the state of the art in AI workflows on Kubernetes, and pave the way to adoption of these innovations into Red Hat OpenShift AI and watsonx.ai. With this approach, we are enabling an enterprise-ready platform for the end-to-end life cycle of foundation models. We look forward to collaborating with you in upstream communities.