A second important choice was on the design of the AI node. Given the desire to use Vela to train large models, we opted for large GPU memory (80 GB), and a significant amount of memory and local storage on the node (1.5TB of DRAM, and four 3.2TB NVMe drives). We anticipated that large memory and storage configurations would be important for caching AI training data, models, other related artifacts, and feeding the GPUs with data to keep them busy.
A third important dimension affecting the system’s performance is its network design. Given our desire to operate Vela as part of a cloud, building a separate Infiniband-like network — just for this system — would defeat the purpose of the exercise. We needed to stick to standard ethernet-based networking that typically gets deployed in a cloud. But traditional supercomputing wisdom states that you need a highly specialized network. The question therefore became: what do we need to do to prevent our standard, ethernet-based network from becoming a significant bottleneck?
We got started by simply enabling SR-IOV for our network interface cards on each node, thereby exposing each 100G link directly into the VMs via virtual functions. In doing so, we also were able to use all of IBM Cloud’s VPC network capabilities such as security groups, network access control lists, custom routes, private access to PaaS services of IBM Cloud, access to Direct Link and Transit Gateway services.
The results we recently published with PyTorch showed that by optimizing the workload communication patterns, controllable at the PyTorch level, we can hide the communication time over the network behind compute time occurring on the GPUs. This approach is aided by our choice of GPUs with 80GB of memory (discussed above), which allows us to use bigger batch sizes (compared to the 40 GB model), and leverage the Fully Shared Data Parallel (FSDP) training strategy more efficiently. In this way, we can efficiently use our GPUs in distributed training runs with efficiencies of up to 90% and beyond for models with 10+ billion parameters. Next we’ll be rolling out an implementation of remote direct memory access (RDMA) over converged ethernet (RoCE) at scale and GPU Direct RDMA (GDR), to deliver the performance benefits of RDMA and GDR while minimizing adverse impact to other traffic. Our lab measurements indicate that this will cut latency in half.
Each of Vela’s nodes has eight 80GB A100 GPUs, which are connected to each other by NVLink and NVSwitch. In addition, each node has two 2nd Generation Intel Xeon Scalable processors (Cascade Lake), 1.5TB of DRAM, and four 3.2TB NVMe drives. To support distributed training, the compute nodes are connected via multiple 100G network interfaces that are connected in a two-level Clos structure with no oversubscription. To support high availability, we built redundancy into the system: Each port of the network interface card (NIC) is connected to a different top-of-rack (TOR) switch, and each TOR switch is connected via two 100G links to four spine switches providing 1.6TB cross rack bandwidth and ensures that the system can continue to operate despite failures of any given NIC, TOR, or spine switch. Multiple microbenchmarks including iperf and NVIDIA Collective Communication Library (NCCL), show that the applications can drive close to the line rate for node-to-node TCP communication.
While this work was done with an eye towards delivering performance and flexibility for large-scale AI workloads, the infrastructure was designed to be deployable in any of our worldwide data centers at any scale. It is also natively integrated into IBM Cloud’s VPC environment, meaning that the AI workloads can use any of the more than 200 IBM Cloud services currently available. While the work was done in the context of a public cloud, the architecture could also be adopted for on-premises AI system design.
Having the right tools and infrastructure is a critical ingredient for R&D productivity. Many teams choose to follow the “tried and true” path of building traditional supercomputers for AI. While there is clearly nothing wrong with this approach, we’ve been working on a better solution that provides the dual benefits of high-performance computing and high end-user productivity, enabled by a hybrid cloud development experience. 2 Vela has been online since May 2022 and is in productive use by dozens of AI researchers at IBM Research, who are training models with tens of billions of parameters. We’re looking forward to sharing more about upcoming improvements to both end-user productivity and performance, enabled by emerging systems and software innovations.2 We are also excited about the opportunities that will be enabled by our AI-optimized processor, the IBM AIU, and will be sharing more about this in future communications. The era of cloud-native AI supercomputing has only just begun. If you are considering building an AI system or want to know more, please contact us.
How to Deploy a High-Performance Distributed AI Training Cluster with NVIDIA A100 GPUs and KVM, at GTC 2022 ↩
Seetharami Seelam, Keynote Talk - Hardware-Middleware System co-design for foundation models, at the 23rd ACM/IFIP International Middleware Conference, Quebec City, Quebec, Canada. ↩ ↩2