IBM Research and PyTorch have come together to enable foundation models with billions of parameters to easily run on standard cloud networking infrastructure, such as Ethernet networking.
The field of AI is in the middle of a revolution. In recent years, AI models have made images, songs, or even websites out of simple text prompts. These types of models with billions of parameters, called foundation models, can with little fine-tuning be repurposed from one task to another, removing countless hours of training and labelling, and refitting a model to take on a new task.
Foundation models have been primarily trained on high-end high-performance computing (HPC) infrastructure, which while reliable, are a costly barrier to entry for many looking to train foundation models for their own uses. These systems for training AI models have to be custom designed, rarely relying on commodity hardware options. Top-of-the-line GPUs are paired with low-latency InfiniBand network systems, which are costly to set up and run and also require bespoke operating processes, raising the cost even further.
Researchers at IBM have been working with the distributed team within PyTorch, the open-source machine learning platform run by the Linux Foundation, to find a way to train large AI models on affordable networking hardware. The group’s research has shown it’s possible to scale and train large models using regular Ethernet based networking on Red Hat’s OpenShift platform.
With PyTorch’s FSDP, the team was able successfully train models with 11 billion parameters using standard Ethernet networking on IBM Cloud. Our approach achieves on-par efficiency training models of this size as high-performance computing (HPC) networking systems, approaching performance once considered only achievable in traditional HPC environments.
Previous attempts to train models with billions of parameters on PyTorch with Ethernet resulted in poor performance, far below what you would need to train a foundation model. With cloud computing, systems expect to be fully allocated at all times. As AI models get larger, the standard methods for data parallel training work only if the GPU can hold a full replica of the model along with its training state. While new training techniques — like PyTorch’s Fully Sharded Data Parallel (FSDP) or DeepSpeed — can efficiently distribute a model and data over multiple GPUs during training, they only ran efficiently on HPC systems, rather than ethernet-connected systems. The joint team explored FSDP’s API and built a new control called rate_limiter, which controls how much memory is allocated for sending and receiving tensors, alleviating the memory pressure on the system and improving scaling efficiency 4.5 times over previous approaches.
The infrastructure the team used for this work was essentially off-the-shelf hardware. Running on the IBM Cloud, the system consists of 200 nodes, each with eight Nvidia A100 80GB cards, 96 vCPUs, and 1.2TB CPU RAM. The GPU cards within a node are connected via NVLink with a card-to-card bandwidth of 600GBps, and the nodes are connected together by two 100Gbps Ethernet links with a The single root I/O virtualization (SR-IOV) interface is a PCIe specification that allows hardware like a network adapter to separate access to resources among PCIe hardware functions.SR-IOV based TCP/IP stack, providing a usable bandwidth of 120Gbps (for 11B model, we observed peak network bandwidth utilization of 32Gbps).
“We wanted to invest more in GPUs — not the networking hardware,” said Raghu Ganti, a Master Inventor at IBM Research working on scaling foundation models.
This GPU system has been running since May and is configured with the Red Hat OpenShift container platform to run AI workloads. The team is building a production ready software stack for end-to-end training, fine-tuning, and inference of large AI models.
We believe this approach is the first in the industry to achieve scaling efficiencies for models with up to 11 billion parameters that use Kubernetes and PyTorch’s FSDP APIs with standard Ethernet. This will allow researchers and organizations to train massive models in any cloud in a far more cost-efficient and sustainable way. In 2023, the goal of the joint team is to continue scaling this technology to handle even larger models.