Technical note
3 minute read

IBM Storage Scale delivers real-world performance: an in-depth analysis

IBM's Blue Vela supercomputer.

The infrastructure required to train large language models (LLMs) has become a significant fixation in the field of artificial intelligence, mainly due to the intricate and resource-intensive nature of the task at scale. IBM designs AI clusters engineered to handle multiple diverse workloads concurrently, streamlining various aspects of the training process, which includes pre-training, fine-tuning, evaluation, and even inference tasks. This multifaceted approach not only enhances efficiency but also maximizes resource utilization, making it a compelling choice for organizations navigating the complexities of AI development.

In the spring of 2024, IBM Research unveiled the Blue Vela supercomputer, designed and built in partnership with NVIDIA and Dell. Blue Vela was conceived to facilitate the training of the Granite family of models and is based on the NVIDIA H100 GPU Platform. As customer zero, this cluster was also the first to receive IBM Storage Scale System 6000s.

We are excited to announce the latest MLPerf Storage benchmark results, showcasing the impressive capabilities of IBM Storage Scale when used in conjunction with the Scale System 6000, which supports cutting-edge AI workloads.

As the parameter count of large language models continues to grow, there is a corresponding increase in checkpoint sizes, which can become cumbersome and challenging to manage. Each checkpoint serves as a pivotal moment in the training process; should a checkpoint write fail, it erases all progress made since the last successful checkpoint. This potential for loss not only represents wasted computational resources but also translates into significant financial costs for organizations utilizing these models at scale.

The Llama models, ranging in size from 8 billion to 1 trillion parameters is a popular family of models in the industry. The checkpointing benchmark highlights the tasks of reading and writing checkpointing during training, a critical step that involves saving the state of a model at particular intervals.

The results from our benchmarks were impressive; our tests with the Llama 3.1 1T model demonstrated a read bandwidth of 656.7 GiB/s and a write bandwidth of 412.6 GiB/s. To put this into perspective, this translates to roughly 23 seconds required to load a model checkpoint and about 37 seconds to save it. Similarly, the Llama 3 405B model showed nearly equivalent performance, with a read bandwidth of 624.7 GiB/s and a write bandwidth of 384.7 GiB/s, achieving approximately 8.5 seconds for loading and 14 seconds for saving. These figures underscore not only the speed but also the reliability of our storage solutions when managing the demanding requirements of AI workloads.

The actual model only accounts for a portion of the data in a given checkpoint. The bulk of the data is the optimizer state of the training job at checkpoint time. For example, the checkpoint for the 8 billion parameter model totals 105GB, and the optimizer state makes up 90GB. For the 1 trillion parameter model, the optimizer makes up a whopping 13.2TB of the 15TB of checkpoint data.

20240123_IBM_GPU_SHOT_07.jpg
IBM's Blue Vela supercomputer.

One of the key highlights of our performance metrics is that we allocated only 4% of the Blue Vela cluster to execute these benchmark workloads. The remaining 96% of the cluster remained fully utilized by a diverse set of AI research workloads and still achieved impressive results. The ability of IBM Storage Scale to optimize resource distribution while still achieving top-tier performance was a critical aspect of our design philosophy. By showcasing our ability to manage these workloads efficiently on a relatively small number of compute nodes, we illustrated the robust scalability of IBM Storage Scale.

Realistically, you would need closer to 512 to 1,024 GPUs to fully fine-tune Llama 3.1 1T efficiently since it is a dense model, and even more for pre-training. But with the recent advancements in mixture of experts, sparsity, and state space models, training large models is becoming feasible with fewer resources.

The implications of these findings extend beyond just performance metrics. They signal IBM's commitment to pushing the boundaries of storage performance tailored for the evolving demands of AI across various sectors. In numerous industries, from finance and health care to autonomous driving and natural language processing, the need for high-performance data storage solutions is escalating. By continuously innovating and optimizing our storage systems, we aim to empower organizations to leverage the potential of AI, enabling them to train more sophisticated models while maintaining high efficiency and performance levels.

Looking ahead, we will continue our pursuit of advancements in AI use cases and benchmarks. Our goal is simple: to provide our customers with the highest levels of performance and efficiency possible. Our ongoing investments in research and development will ensure that we remain at the forefront of this ever-evolving field, as we refine our solutions and expand our offerings to meet the diverse and changing needs of our clients.

Our latest AI cluster, built in partnership with CoreWeave and based on the NVIDIA GB200 Platform, was the largest entry of NVIDIA’s newest platform for the ML Perf Training V5.0 benchmark suite, training Llama 3.1 405B to standard quality target in less than 28 minutes back in May — before it was even deployed. The cluster will also include a considerably more powerful IBM Storage Scale 6000 storage system when it is fully deployed.

Related posts