4 minute read

IBM Research unveils two key advances for foundation models

At this year’s Ray Summit, researchers at IBM showed off two features, running on top of Ray, that make it easier to set up and run foundation models for AI workloads.


At this year’s Ray Summit, researchers at IBM showed off two features, running on top of Ray, that make it easier to set up and run foundation models for AI workloads.

We are deep into the era of Big Data, where systems that can analyze and extract value out of massive datasets have come to underpin many aspects of the way we live and work. But as the datasets have gotten bigger and the models data scientists use to explore them have gotten more complex, the amount of time that must be spent configuring experiments has also increased dramatically.

That’s a large part of the reason why we unveiled CodeFlare at last year’s Ray Summit. It’s an open-source framework, built on top of Ray, that makes it easier to set up, run, and scale machine learning models. The tool has already helped developers set up testing pipelines in a matter of minutes, rather than hours. In some cases, we’ve heard it has helped shave months off developer time previously spent setting up machine learning pipelines.

More recently, we refined CodeFlare to be able to automate foundation model transfer learning on the hybrid cloud. This turned CodeFlare into a tool that allowed businesses to automate their AI and machine-learning workflows on the hybrid cloud, while reducing the time it takes to train and deploy a foundation model.

At this year’s Ray Summit, we’re taking things even further. We’ll be discussing two new efforts at the conference that aim to make both the creation and utilization of foundation models at scale easier than ever before.

Less work up front for teams

The first talk is by Linsong Chu about his and Rong Zhang's work, which aims to cut down on the amount of pre- and post-processing needed on datasets.

For organizations that have multiple teams of data scientists interested in natural-language processing, it’s possible to save more time by evaluating large NLP models using CodeFlare.

Validating large-scale language foundation models can be challenging. The models need to be fine-tuned, while also gauging their efficacy on tasks they may be applied to. These tasks could be tackling completely different problems, running in different environments, or be receiving different types of inputs. As such, different teams might have different pipelines for validating a model for their use. One might run something like a GLUE Benchmarking pipeline with nine sub-tasks, while another team might run a sentiment analysis pipeline with 17 sub-tasks. This can take days to coordinate and restart or rerun these tasks with failures and changing resource allocations, such as increasing or decreasing GPU allocations.

Chu’s team looked into creating a tool that can support validating models for various tasks at any scale. They added Ray workflows into their pipeline, which resulted in better auto-scaling, better resource management, and unified workflows for different tasks. What could’ve taken 17 servers for five teams and weeks of work can now be achieved by one person coordinating for all teams in about 15 minutes in the cloud.

The work leverages Ray workflows, a component of Ray to which IBM Research is a key contributor, to achieve auto-scaling, dependency management, and better overall validation performance, while drastically increasing productivity on all teams involved.

Making running foundation models less costly

In the second talk, Fred Reiss discusses the reality of running foundation models today. Although foundation models can be fine-tuned to handle a myriad of different tasks, they need to be copied and re-tuned for each new task. Put another way, they’re the engine in the vehicle that is the process you want to carry out. But people don’t tend to buy engines — they buy cars. For every copy of a foundation model tweaked to serve a new purpose, such as a model that translates to French and another variation that translates to Mandarin, you have to host a new version of that model. It becomes a giant fleet of vehicles that need to be maintained.

In modern cloud computing setups, it’s expensive to load foundation models whenever they’re needed — orders of magnitude more than using the model to perform inference. As a result, most applications need models to be up and running at all times. Foundation models are often quite large, with each individual copy potentially being a few gigabytes in size, so keeping many models constantly loaded comes with a significant cost. This cost is particularly acute for less-used, but still needed, containers, such as in a natural language processing system that supports 20 languages, where 17 languages aren’t used as frequently as the top 3.

But Reiss’s team’s work has looked at lowering the cost of loading models down to nearly nothing. Reiss’s solution, called zero-copy model loading, stores the weights of a deep learning model in shared memory so that any process with access to that memory can load the model near-instantaneously.

While it would be possible to implement zero-copy model loading on other model-serving engines, Reiss argued that Ray is an especially good platform for this technique. Ray’s shared-memory object store, Plasma, makes it easy to implement zero-copy model loading. Because Ray tasks can load models instantly from the local Plasma shared memory segment, Ray’s scheduler can quickly redirect cluster resources towards whatever model the application currently needs to run the most.

Relying on Ray’s scheduler in this way frees users from having to tune the number of copies of each model their model serving system keeps loaded in memory. And that in turn leads to a dramatically simpler path to a scalable model deployment. In a benchmark study described in the talk, Reiss showed that he could run 12 deep learning models on one machine, where each model can be requested as needed, loaded, and unloaded, resulting in a sevenfold scalability improvement without any tuning.

In both instances, IBM Research is working on new ways to ensure that CodeFlare is helping enterprise Ray users bring more parts of their machine-learning pipeline into an easy to use, end-to-end system. If you’re interested in learning more about how CodeFlare could help your business streamline its data research, sign up here to apply to take part in our beta program.