16 Nov 2022
4 minute read

The future of foundation models requires innovations across the entire stack

To truly harness the potential of foundation models, there needs to be significant investment in the entire stack — not just the models themselves.


To truly harness the potential of foundation models, there needs to be significant investment in the entire stack — not just the models themselves.

At IBM Research, we’re excited about the potential of foundation models. To us, they represent an important paradigm-shift in AI. A single (very large) natural-language foundation model can be made to perform well on many different language-related tasks with far less data than what was historically needed to train individual task-specific models one by one. This adaptation to new tasks can be done by simply “prompting” the model to do something new; when provided with a task description and a few examples, these models appear to take on a new tasks without needing to be explicitly re-trained, or modifying the model itself. This ability to re-purpose a foundation model for many use cases without additional training puts them in a league of their own.

While much of the early innovation in foundation models was geared towards language, we are seeing mounting evidence that the approach applies to many domains, from the future of manufacturing, to automation for developers writing code, and discovering new materials and drugs. We shared a recent example of the transformational potential of these models when we announced Project Wisdom: Developers can now use natural-language commands, like “Deploy web application stack”, or “Install Nodejs dependencies,” and Project Wisdom will parse them and build an automation workflow to accomplish the task, delivered as an Ansible Playbook. This has the potential to dramatically boost developer productivity, extending the power of AI assistance to new domains.

While there is mounting excitement around the potential that surrounds this next wave of AI, the reality is that working with foundation models is remarkably complex. The end-to-end process of going from data to a functional model that’s ready to deploy can take weeks of manual work and often considerable compute power. To truly harness the potential of foundation models, there needs to be significant investment in the entire stack — not just the models themselves. We can co-design systems and software to bring the end-user the optimal environment for maximizing their AI productivity.

Foundation model full stack

Most organizations have gaps in important areas, from skills to automation and tooling, to the availability of the right infrastructure. In fact, we ourselves experienced some of these challenges and it’s why we’ve been investing in creating technologies across the hybrid cloud stack that enable our own AI researchers to move faster, easily share and port workloads and experiments into new environments, automate key parts of their AI workflows, and maximize their infrastructure utilization and efficiency. We’re very excited to share the fruits of that work with AI community, and IBM’s customers, over the coming months.

There are three main categories of innovations we’ve been developing in service of making AI researchers and developers more agile. The first is around workflow automation and simplification. We’ve been working closely with key open-source communities, including Ray and PyTorch, to adopt and contribute new capabilities. With Ray, we’ve been working to simplify all of Including data de-duplication, removal of hate, abuse and profanity, and removing placeholder values.the data pre- and post-processing steps of the AI workflow, like removal of hate, abuse, and profanity, as well as simplifying model adaptation and validation after the model is trained. With PyTorch, we’ve been working on efficiently scaling distributed training to larger models over more standard infrastructure like ethernet networking. By leaning into these key projects, the broader AI research community can benefit from the many enhancements that are rapidly accruing in these key communities and leverage a single optimized stack that can make everyone more productive.

The second key innovation area we’re driving is in Kubernetes and our OpenShift hybrid cloud platform, to meet the unique demands of these workflows. Many people assume you need a traditional high-performance computing environment to run these jobs efficiently. They think about things like bare-metal nodes, Infiniband networking, HPC schedulers, and file systems. But these environments each operate with their own software stacks, managed by HPC administrators, and porting workloads across them can be a challenge. At the same time, containerizing workloads allows us to package all the software we need together. This makes it easier for teams to share code and results and eliminates dependencies on someone else choosing to support the libraries you need.

We’ve jumped into cloud-native AI with both feet, standing up the largest and highest-performance installation of OpenShift that we are aware of, and moving all of our foundation models research to this platform. To get our AI workflows to run efficiently in containers and with high performance on OpenShift, we are building many of the key value propositions of HPC environments into the platform itself, including sophisticated job management, job auto-scaling, automation for optimal network configuration, and many others and delivering them as a service. Now, our cloud-native workloads can run with high performance and hybrid cloud flexibility anywhere that supports Kubernetes, which is an increasingly large number of places.

The third innovation area is in working to develop high-performance, flexible, AI-optimized infrastructure, delivered as a service, for both training and serving foundation models. Training today’s large models requires a lot of GPUs. Historically, researchers have relied on HPC clusters because that’s where a sufficiently large number of GPUs could be found with the high performance networking. But moving data between cloud environments and on-premises HPC systems is time consuming — and sometimes forbidden. We knew that putting the capability of a supercomputer in the same location as our data would help us move faster and avoid time spent complying with complex data policies. We also wanted to retain the flexibility and services that come along with cloud computing. So we built a supercomputer (accessible as a service) natively into the fabric of IBM Cloud’s virtual private cloud offering.

Our system, comprised of Nvidia A100 GPUs and flexible ethernet-based networking, is now the primary environment where we conduct our foundation model research and development, which has shifted dramatically towards cloud-native work on OpenShift. This means that our researchers are running containerized distributed training jobs, orchestrating hundreds of GPUs, to build models with over 10 billion parameters on this system. These jobs, running on the largest, highest-performance installation of OpenShift that we are aware of, are achieving between 80% and 90% GPU utilization — a level of infrastructure performance and efficiency often reserved for traditional supercomputing environments. Our infrastructure vision doesn’t stop at training — or even at the GPU. We recently shared our work to develop a next-generation AI chip, the IBM AIU, which brings innovations in reduced-precision AI computing to the next level. We expect this chip to deliver significant energy efficiency benefits over traditional chips.

We’re passionate about inventing the next generation of AI and about inventing technologies to help us all move faster and work at the forefront of innovation. The capabilities we’re creating across the stack are making it easier to build the most advanced AI models and get them productively and ubiquitously deployed. We can’t wait to share our progress in all of these areas over the coming weeks and months, and to enable our partners to benefit from the tools and technologies we’re building.


16 Nov 2022


  1. Note 1Including data de-duplication, removal of hate, abuse and profanity, and removing placeholder values. ↩︎