A cloud-native, open-source stack for accelerating foundation model innovation
Foundation models, and generative AI, have captivated our collective imagination and enabled the discovery of new ways to improve the way we live and work. From more seamless interactions with technology via natural language, to automatic generation of code or other data, to use cases across various domains of science, applications of foundation models are growing by the day. At IBM, our goal is both to infuse this technology across our product portfolio and to help our customers adopt foundation models into their own offerings quickly, efficiently, and safely.
As part of this journey, we shared our perspective on why we built Vela, an AI supercomputer, in the IBM Cloud. That work was part of a larger effort to reimagine our full technology stack to accelerate how we train, fine-tune, and deploy cutting-edge AI models. Through this process, we’ve built a modern, flexible AI software stack optimized for the foundation model era.
In this blog, we’ll describe our high-performing, cloud-native AI training stack running on the Red Hat OpenShift Container Platform that serves as the foundation for the newly launched watsonx platform.
Complementing our training stack is our technology stack for tuning and serving foundation models in a cost and performance optimized manner. Many of the technologies described below have already been contributed to open-source communities like PyTorch, Ray, Kserve, and Open Data Hub (ODH), an open-source platform for building, deploying, and managing data-intensive applications on Kubernetes. Technologies matured in ODH then feed into Red Hat’s OpenShift AI and now IBM’s next generation AI platform, watsonx.ai, leverages Red Hat OpenShift AI. With this approach, IBM and Red Hat can provide customers with a state-of-the-art open-source foundation model stack to run in any environment of their choosing (on-premises, on IBM Cloud, or in other public clouds).
As we set out to reimagine our AI training stack, we had two goals in mind. First, we wanted to keep the utility of traditional HPC systems: maximum hardware utilization and efficient use of high-performance infrastructure. Second, we wanted to deliver the flexibility and productivity benefits that come from a hybrid cloud development experience: greater development agility, code reuse, and simplicity of managing and scaling infrastructure and software. To achieve the second goal, we built our solution with Kubernetes, where containers provide the path to reuse code and scale software. This decision, however, meant that we’d need to turn Kubernetes into a platform for high-performance workloads.
We also needed a solution that addressed each step of our AI training workflow: data pre-processing, distributed training, and model validation. We identified key open-source communities to partner with to address the whole workflow end-to-end, and key user-experience barriers we needed to overcome for users to launch, run, and scale their jobs.
The left side of Figure 1 below provides the overall picture of our training software stack, which has been running on Vela in IBM Cloud since late 2022 and is in use across IBM Research. The right side of Figure 1 depicts our stack for tuning and serving foundation models, which will be discussed later in the blog.
Advanced Kubernetes-native resource utilization and management
When we started this work, the Kubernetes ecosystem still had significant gaps for large-scale and high-performance AI workloads. One early area of focus became how to expose infrastructure capabilities, like network resources, to the workload without incurring additional overheads. To this end, we created a multi-NIC CNI operator that configures the underlying network interfaces and cuts network latency in half, as it eliminates encapsulation, and increases bandwidth by seven times compared to the out-of-the-box container networking solution. These improvements are completely transparent to the end user.
The second gap we sought to fill was employing the right cloud-native job scheduler. With so many AI developers wanting to submit jobs to run on Vela, we needed a scheduler to allocate resources and prioritize jobs to maximize resource utilization. To solve this problem, IBM researchers created the multi-cluster app dispatcher (MCAD), which provides job queueing, job priorities and preemption, timeouts, and orchestration of resource sharing among the users of the system. In addition, we enabled workload packing and gang scheduling to eliminate resource fragmentation, all running on top of OpenShift. We further developed InstaScale, which works with MCAD to dynamically scale cloud-hosted OpenShift clusters. By automatically acquiring and releasing GPUs on demand from the cloud provider, InstaScale frees practitioners from worrying about infrastructure management and cost.
Scalable, efficient data pre-processing, model training and validation
To make it simple and efficient to run all the steps in the AI pipeline, we have focused on leveraging, and contributing to, two key open-source technologies, PyTorch and Ray. With Ray, we enable scalable data preprocessing (such as filtering data using hate, abuse, and profanity filters) and post processing steps (like model fine-tuning and validation) with a data scientist-friendly Python API. By running Ray with MCAD, we support efficient sharing of resource pools by heterogeneous Ray jobs running concurrently.
We are collaborating with PyTorch to advance support for distributed training, including evolving support for Fully Sharded Data Parallel (FSDP) training APIs through the introduction of rate_limiter. We recently demonstrated efficient scaling of distributed training jobs for models with 10B+ parameters over Ethernet-based environments like Vela in IBM Cloud. And, by integrating MCAD with TorchX, a universal job launcher for PyTorch applications, we are able to transparently support a wide range of PyTorch-based jobs using different APIs and frameworks. All of these diverse jobs benefit from the underlying job management system without requiring code modifications on the part of the AI practitioner.
Simplified user experience
The training portion of the workflow itself occurs in several steps: model exploration (usually a scaled-down experiment run with a few GPUs), scaling the distributed training job (consuming hundreds of GPUs), and finally, model validation. Orchestrating these steps can be complex for many AI practitioners, and time is lost configuring and managing them. We addressed this challenge through Project CodeFlare, which provides a guided, simplified user experience to efficiently train, test, and monitor the model training life cycle.
CodeFlare CLI (which is console and UI based) guides users through the complexities of running against a remote OpenShift cluster while automating job configuration, storage set up, logging, and endpoints for monitoring and profiling. CodeFlare SDK (which is Jupyter based) provides users with an intuitive Python interface for batch resource requesting, job submission, and observation. With these features, we significantly lower the barrier of entry to a cloud-native stack for our AI research colleagues.
Operationalizing our stack on Vela
By the end of 2022, all of IBM’s foundation model training work transitioned to running with this software stack on Vela in IBM Cloud. Today, MCAD manages the queue of these AI jobs, from single-GPU jobs to those leveraging more than 512 GPUs, and handles job prioritization and quota management. Along this journey, we discovered additional ways we could make life easier for teams managing OpenShift clusters in GPU-centric environments like Vela, e.g., by enhancing the OpenShift Installer Provisioned Infrastructure (IPI) to make it easier to deploy and manage OpenShift on high-performance infrastructure.
Training and validating state-of-the-art foundation models are the critical early stages of the AI value chain, but true value is ultimately captured when models get put to productive use in the tuning and inferencing steps of the AI workflow. Our software stack for inference and model tuning is focused on executing models efficiently on the underlying hardware, batching incoming requests in an optimal way, simplifying the integration of AI into applications, and providing state-of-the-art techniques for model adaptation. The right side of Figure 1 above depicted our foundation model tuning and serving stack, which is described in more detail below.
Inference performance
Software libraries for optimizing the way foundation models run on a given hardware platform can improve throughput and latency by 10-100x. Our serving stack includes a curated set of mature optimization paths (including ONNX and Hugging Face Optimum) for inferencing common model architectures and is extensible to accommodate new inferencing servers or optimizations as they emerge. Extensibility is a key design point for our stack, considering the rapid pace of innovation in the AI and open-source communities. In addition, real AI services receive a high volume of inference requests from multiple users, which may target multiple models in parallel. Our serving stack dynamically batches incoming requests and efficiently multiplexes between models by building upon, and contributing back, to the Hugging Face, Kserve, and Model Mesh communities.
Simplifying application integration
Inference servers available today for running AI models require significant AI-specific knowledge on the part of the user. The input to the model is a tensor, and the output to the model is a tensor. This format is not approachable for application developers looking to leverage these models to accomplish a task. To make this process more developer friendly, the model output must be converted into something more consumable. We have created an abstraction layer, called Caikit, that provides intuitive APIs and data models for application developers, and provides a stable interface that allows the model and application to evolve independently. This abstraction is used in IBM’s Watson model serving infrastructure and will soon be contributed to open-source.
Foundation model tuning
One of the key value propositions of foundation models is the ability to leverage a pre-trained base model, and “tune” or “adapt” it using specialized data to improve its performance for a downstream task. Our goal is to package state-of-the-art techniques for compute-efficient model adaptation and make them easy to use with little knowledge about how they work. Our extensible stack currently supports Multi-task Prompt Tuning (MPT) and fine tuning, integrated through an open-source project called PEFT (Parameter Efficient Fine Tuning). Over the next few months, we will be open-sourcing a number of our prompt tuning algorithms and implementations.
IBM Research is working with Red Hat to enable others to benefit from this work by contributing the capabilities we’ve developed to key open-source communities and directly to Open Data Hub (ODH). ODH is a comprehensive collection of open-source tools designed to leverage the strengths of OpenShift to facilitate the entire AI development lifecycle. Many of the technologies introduced in Open Data Hub mature to become part of Red Hat OpenShift AI and serve as the middleware base for watsonx.ai. Figure 2 shows how the various contributions to open source described in this blog will come together into ODH to support foundation model use cases.
Reimagining our end-to-end software stack for the foundation models era has had considerable value for our AI community. AI researchers no longer need very detailed infrastructure knowledge to get jobs to run with high performance. They no longer need to figure out how to scale jobs from a few GPUs to hundreds, or how exactly to distribute the jobs to achieve high workload performance — these tasks are handled by the software stack. Code is reusable across teams, and experiments are easily reproducible by others. We’ve also considerably simplified how AI developers can serve and tune foundation models with high compute efficiency and in a developer-friendly manner.
Perhaps most importantly, building this stack on OpenShift provides portability to other environments so our partners can leverage these capabilities on-premises and in any public cloud. Together with Red Hat, we are excited to bring these innovations to the open-source community through the Open Data Hub, advance the state of the art in AI workflows on Kubernetes, and pave the way to adoption of these innovations into Red Hat OpenShift AI and watsonx.ai. With this approach, we are enabling an enterprise-ready platform for the end-to-end life cycle of foundation models. We look forward to collaborating with you in upstream communities.