13 Dec 2023
News
4 minute read

Supercharging IBM’s cloud-native AI supercomputer

It’s been a year of massive strides in AI, with new technologies becoming household names and models with tens of billions of parameters becoming commonplace for real-world use cases. At IBM, we launched watsonx, the data and AI platform for enterprise, to bring these advanced AI capabilities to IBM customers across a wide variety of industries, leveraging many innovations that emerged from our IBM Research community.

There’s a growing need to design systems with the right compute capabilities to efficiently carry out the various stages of the AI lifecycle. This is partly why IBM decided to build Vela, an AI supercomputer in the IBM Cloud, last year. Vela allows us to efficiently deploy our AI workflows — from data pre-processing, model training and tuning, to deployment and even new product incubation — all within the IBM Cloud.

Vela was designed to be flexible and scalable, capable of training today’s large-scale generative AI models, and adaptable to new needs that may arise in the future. It was also designed such that its infrastructure could be efficiently deployed and managed anywhere in the world. Over the last year, AI practitioners from across IBM have trained and prototyped AI technologies on Vela, including IBM’s next-generation AI studio, watsonx.ai, which became generally available in July. Bringing a platform like watsonx.ai online so quickly around the world would not have been possible without Vela’s cloud-first design.

One year in, IBM is scaling Vela for what’s ahead. Today, we’re sharing several major upgrades we’ve made to Vela over the last year — including nearly doubling the capacity of the system and dramatically improving the speed of Vela’s network. Let’s break down what’s new, and how we made it happen.

Speeding up Vela

This particular wave of AI has unique inter-dependencies with the underlying infrastructure that it takes to train and deploy it. In the push towards bigger models, trained over ever-larger data sets, moving faster means using more GPUs per job. As more GPUs compute in parallel, we need a commensurate increase in network performance to ensure that GPU-to-GPU communication doesn’t become a bottleneck to workload progress. This year, we deployed a major upgrade to the Vela network that allows us to efficiently scale training individual workloads to thousands of GPUs per job. The core enabling technologies that we deployed on Vela were RoCE (RDMA over Converged Ethernet), and GDR (GPU-direct RDMA).

Remote direct memory access (RDMA) allows one processor to access another processor’s memory without having to involve either computer’s operating systems. This leads to much faster communication between the processors by eliminating as many of the intervening processes as possible. GPU-direct RDMA allows GPUs on one system to access the memory of GPUs in another system, using network cards (as shown in the figure below), going over the ethernet network. By enabling GPU-direct RDMA over our ethernet network in Vela, we improved our network throughput by two to four times, reduced our network latency by six to 10 times.

We are also able to scale workloads out nearly linearly to much larger models than previously possible. This includes training the 20 billion parameter Granite model we recently announced, which is a key enabler of our watsonx Code Assistant for Z service. The RoCE and GDR upgrade was several years of research in the making. It required simultaneous changes and enhancements to nearly every part of our cloud stack, from the system firmware to the host operating system, to virtualization, to the network underlay and overlay.

Vela Update 01 thumbnail.jpgDiagram showing the difference in communication path before and after deployment of RoCE + GDR.

Increasing capacity

While Vela was designed to be expandable, the team wanted to do more than just add more GPUs to Vela; we wanted to do it in a space- and resource-efficient manner. In particular, we looked for a way to double the density of the server racks, which roughly doubled capacity without increasing the space or networking equipment required.

After analyzing AI workload patterns, we determined we could move ahead with our capacity expansion within the power and cooling resources that were already available without impacting workload performance. We then worked with our partners to develop a highly optimized power capping solution. This allowed Vela to essentially “overcommit” the amount of power available to a rack safely. We then developed a testing framework for all pertinent components to ensure everything was working safely after the expansion, without any detrimental impact to the system or the workloads that needed to run efficiently on Vela. As a result, Vela is now comprised of around twice as many GPUs as it had prior to the upgrade.

Vela Update 02 thumbnail (1).jpgArchitecture of Vela after capacity increase.

Improving operations and diagnostics

The team behind Vela also looked at ways to run the system more efficiently. Because of their complexity, AI servers have a higher failure rate than many traditional cloud systems. Moreover, they fail in unexpected (and sometimes hard to detect) ways. Furthermore, when nodes – or even individual GPUs – fail or degrade it can impact the performance of an entire training job running over hundreds or thousands of them. Automation which detects and finds these kinds of issues and produces alerts as quickly as possible is therefore important to keeping the environment productive.

This year, the IBM teams enhanced the automation in IBM Cloud, cutting the time it takes to find and understand these kinds of hardware failures and degradations on Vela in half. Now, servers can be brought back into the production fleet far faster than before. Lessons learned from managing an environment this complex have been rolled out more broadly to improve operations across the rest of IBM Cloud’s virtual private cloud (VPC) environment.

What’s next

Even before these upgrades, Vela was already a powerful platform that accelerated the launch and deployment of watsonx.ai all over the world, as well as the development of our core underlying platform, OpenShift AI. And with the latest infrastructure advancements in Vela, we’re training increasingly powerful models that will help solve some of the most pressing business problems our customers face.

In much the same way that this is still the start for the AI boom, this is just the beginning for IBM’s AI infrastructure innovation journey. Earlier this year, IBM announced the availability of additional GPU offerings on IBM Cloud, bringing to market innovative GPU infrastructure designed to train, tune and inference foundation models for enterprise workloads. And, with new IBM AI infrastructure technologies maturing, like the IBM AIU chip, so much more is going to be possible in the years ahead.