Most organizations have gaps in important areas, from skills to automation and tooling, to the availability of the right infrastructure. In fact, we ourselves experienced some of these challenges and it’s why we’ve been investing in creating technologies across the hybrid cloud stack that enable our own AI researchers to move faster, easily share and port workloads and experiments into new environments, automate key parts of their AI workflows, and maximize their infrastructure utilization and efficiency. We’re very excited to share the fruits of that work with AI community, and IBM’s customers, over the coming months.
There are three main categories of innovations we’ve been developing in service of making AI researchers and developers more agile. The first is around workflow automation and simplification. We’ve been working closely with key open-source communities, including Ray and PyTorch, to adopt and contribute new capabilities. With Ray, we’ve been working to simplify all of Including data de-duplication, removal of hate, abuse and profanity, and removing placeholder values.the data pre- and post-processing steps of the AI workflow, like removal of hate, abuse, and profanity, as well as simplifying model adaptation and validation after the model is trained. With PyTorch, we’ve been working on efficiently scaling distributed training to larger models over more standard infrastructure like ethernet networking. By leaning into these key projects, the broader AI research community can benefit from the many enhancements that are rapidly accruing in these key communities and leverage a single optimized stack that can make everyone more productive.
The second key innovation area we’re driving is in Kubernetes and our OpenShift hybrid cloud platform, to meet the unique demands of these workflows. Many people assume you need a traditional high-performance computing environment to run these jobs efficiently. They think about things like bare-metal nodes, Infiniband networking, HPC schedulers, and file systems. But these environments each operate with their own software stacks, managed by HPC administrators, and porting workloads across them can be a challenge. At the same time, containerizing workloads allows us to package all the software we need together. This makes it easier for teams to share code and results and eliminates dependencies on someone else choosing to support the libraries you need.
We’ve jumped into cloud-native AI with both feet, standing up the largest and highest-performance installation of OpenShift that we are aware of, and moving all of our foundation models research to this platform. To get our AI workflows to run efficiently in containers and with high performance on OpenShift, we are building many of the key value propositions of HPC environments into the platform itself, including sophisticated job management, job auto-scaling, automation for optimal network configuration, and many others and delivering them as a service. Now, our cloud-native workloads can run with high performance and hybrid cloud flexibility anywhere that supports Kubernetes, which is an increasingly large number of places.
The third innovation area is in working to develop high-performance, flexible, AI-optimized infrastructure, delivered as a service, for both training and serving foundation models. Training today’s large models requires a lot of GPUs. Historically, researchers have relied on HPC clusters because that’s where a sufficiently large number of GPUs could be found with the high performance networking. But moving data between cloud environments and on-premises HPC systems is time consuming — and sometimes forbidden. We knew that putting the capability of a supercomputer in the same location as our data would help us move faster and avoid time spent complying with complex data policies. We also wanted to retain the flexibility and services that come along with cloud computing. So we built a supercomputer (accessible as a service) natively into the fabric of IBM Cloud’s virtual private cloud offering.
Our system, comprised of Nvidia A100 GPUs and flexible ethernet-based networking, is now the primary environment where we conduct our foundation model research and development, which has shifted dramatically towards cloud-native work on OpenShift. This means that our researchers are running containerized distributed training jobs, orchestrating hundreds of GPUs, to build models with over 10 billion parameters on this system. These jobs, running on the largest, highest-performance installation of OpenShift that we are aware of, are achieving between 80% and 90% GPU utilization — a level of infrastructure performance and efficiency often reserved for traditional supercomputing environments. Our infrastructure vision doesn’t stop at training — or even at the GPU. We recently shared our work to develop a next-generation AI chip, the IBM AIU, which brings innovations in reduced-precision AI computing to the next level. We expect this chip to deliver significant energy efficiency benefits over traditional chips.
We’re passionate about inventing the next generation of AI and about inventing technologies to help us all move faster and work at the forefront of innovation. The capabilities we’re creating across the stack are making it easier to build the most advanced AI models and get them productively and ubiquitously deployed. We can’t wait to share our progress in all of these areas over the coming weeks and months, and to enable our partners to benefit from the tools and technologies we’re building.