While Vela was designed to be expandable, the team wanted to do more than just add more GPUs to Vela; we wanted to do it in a space- and resource-efficient manner. In particular, we looked for a way to double the density of the server racks, which roughly doubled capacity without increasing the space or networking equipment required.
After analyzing AI workload patterns, we determined we could move ahead with our capacity expansion within the power and cooling resources that were already available without impacting workload performance. We then worked with our partners to develop a highly optimized power capping solution. This allowed Vela to essentially “overcommit” the amount of power available to a rack safely. We then developed a testing framework for all pertinent components to ensure everything was working safely after the expansion, without any detrimental impact to the system or the workloads that needed to run efficiently on Vela. As a result, Vela is now comprised of around twice as many GPUs as it had prior to the upgrade.
The team behind Vela also looked at ways to run the system more efficiently. Because of their complexity, AI servers have a higher failure rate than many traditional cloud systems. Moreover, they fail in unexpected (and sometimes hard to detect) ways. Furthermore, when nodes – or even individual GPUs – fail or degrade it can impact the performance of an entire training job running over hundreds or thousands of them. Automation which detects and finds these kinds of issues and produces alerts as quickly as possible is therefore important to keeping the environment productive.
This year, the IBM teams enhanced the automation in IBM Cloud, cutting the time it takes to find and understand these kinds of hardware failures and degradations on Vela in half. Now, servers can be brought back into the production fleet far faster than before. Lessons learned from managing an environment this complex have been rolled out more broadly to improve operations across the rest of IBM Cloud’s virtual private cloud (VPC) environment.
Even before these upgrades, Vela was already a powerful platform that accelerated the launch and deployment of watsonx.ai all over the world, as well as the development of our core underlying platform, OpenShift AI. And with the latest infrastructure advancements in Vela, we’re training increasingly powerful models that will help solve some of the most pressing business problems our customers face.
In much the same way that this is still the start for the AI boom, this is just the beginning for IBM’s AI infrastructure innovation journey. Earlier this year, IBM announced the availability of additional GPU offerings on IBM Cloud, bringing to market innovative GPU infrastructure designed to train, tune and inference foundation models for enterprise workloads. And, with new IBM AI infrastructure technologies maturing, like the IBM AIU chip, so much more is going to be possible in the years ahead.