IBM and PyTorch change one line of code to massively improve AI model training

IBM Research and PyTorch worked on a way to make checkpointing for AI training considerably more cost-effective.

We’re in the middle of an AI boom, powered by foundation models. Many of these models are huge in size, with training datasets that are even larger. They can have terabytes of data that are used for training models that have billions of parameters. These models can be used for all sorts of purposes, from generating music, to uncovering new molecules and automating massive enterprise processes.

IBM is working on building trustworthy AI foundation models for enterprises that run seamlessly on public and private clouds. We are building a cloud-native, open-source stack for the future of AI, which is powered by PyTorch to help make building AI systems simpler. During model training, a data scientist periodically writes checkpoints to a system’s permanent storage for fault tolerance and help recover from failures. To reduce wasted GPU time, writing checkpoints must be done quickly, usually this is done by writing to a high-speed shared file system like NFS. Perhaps unsurprisingly, as the model increases in size, so does the checkpoint data size. For example, a model with 11 billion parameters (at 32-bit precision) needs 130 GB of storage for a single checkpoint that includes model weights and the optimizer state.

We wanted to see if we could use a cheaper type of storage that would not sacrifice GPU time. A simple switch from using a shared file system to object storage resulted in unexplainable errors (such as NCCL timeouts and silent failures). IBM and PyTorch teams worked together to fix the distributed checkpointing within PyTorch to support object storages that have AWS S3 is Amazon's Simple Storage service. S3 APIs are standards that can be implemented by other object storage providers, such as IBM COS.S3 APIs, like IBM Cloud Object Storage. This required examining the differences between traditional file system APIs and S3FS (object store backed) file system APIs.

The current distributed checkpointing implementation assumes a Portable Operating System Interface (POSIX) compliant file system that guarantees strong read-after-write consistency. In PyTorch FSDP (Fully Sharded Data Parallel), each GPU is responsible for a shard of the model and optimizer state and writes tensors of a shard to a specific directory designated for that GPU. In a shared file system, when an orchestrator (rank0 GPU) creates a directory, it is visible to all the other GPUs immediately. However, in an eventually consistent file system layer like S3FS on top of an object store, this assumption fails.

Given these consistency problems manifest as non-reproducible errors, the teams had to comb through many log files and come up with many hypotheses to identify the root cause. At the end, the fix was to enable each GPU to create a directory if it does not exist in its local view. Since the writes are non-conflicting, S3FS eventual consistency semantics suffice. With this change, distributed checkpointing of PyTorch can support writing to object storages with S3FS APIs.

An alternative approach for writing checkpoints would have been to gather all the weights and optimizer state to a single node’s RAM, which can result in crashes during the all-gather phase, drastically wasting GPU time and serving as a major blocker. Our measurements showed that for an 11 billion-parameter model, it cuts down checkpointing time from more than an hour to just minutes. Also, object storage is one of the most affordable file storage systems there is, so by making the switch, the researchers could save on training costs.

“We’ve been using a race car when a commuter car would work,” Raghu Ganti, a principal researcher at IBM Research and one of the project leads, said of the type of storage the team had been using. “You can do sophisticated model training with the affordability of object storage.”

Combining this milestone with the team’s earlier work to use inexpensive networking equipment and improve memory scheduling, IBM and PyTorch are getting further down the path to where training and running AI models is quicker, more cost-effective, and able to scale to even larger models — all on IBM’s hybrid cloud platform.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

Notes

Note 1: AWS S3 is Amazon's Simple Storage service. S3 APIs are standards that can be implemented by other object storage providers, such as IBM COS. ↩︎

Can LLMs learn social skills by playing games?
Research
Kim Martineau
23 Jul 2025
- AI
- Generative AI
How IBM’s Kush Varshney became the face of the modern ‘camera man’
Q & A
Kim Martineau
21 Jul 2025
IBM’s new benchmark puts industrial agents to the test
Research
Kim Martineau
15 Jul 2025
How AI could help stretch the life of industrial equipment
Research
Kim Martineau
02 Jul 2025

Notes

Related posts

Can LLMs learn social skills by playing games?

How IBM’s Kush Varshney became the face of the modern ‘camera man’

IBM’s new benchmark puts industrial agents to the test

How AI could help stretch the life of industrial equipment