Publication
FAST 2025
Keynote

Insights Gained from Delivering Two Generations of AI Supercomputers and Storage Solutions in IBM Cloud

View publication

Abstract

AI Supercomputers in public clouds serve as crucial components in the swift and cost-effective creation and deployment of cutting-edge AI models. This heightened demand for potent cloud-native AI supercomputers stems from the increasing prevalence of generative AI and foundational models. In these systems, numerous GPUs collaborate to facilitate model training, optimization, and serve countless concurrent applications without disruption. To ensure optimal performance, reliability, and adaptability for various AI workloads, a comprehensive solution integrating hardware, software, and holistic telemetry is essential. This solution enables the efficient and high-performance execution of multiple AI workload types while maintaining resilience. In this talk, Dr. Seelam will discuss two generations of Vela cloud-native AI systems in IBM Cloud, which form the backbone of IBM's AI endeavors. He will explore the scaling, performance, and high availability challenges confronted during their development and operation. Specifically, he will discuss innovative solutions implemented to tackle these issues, focusing on compute, network, storage, and other pertinent aspects. Furthermore, he will share insights gained from managing these systems using a cloud-native platform for more than two years. Lastly, Dr. Seelam will offer his thoughts on the future directions for harmonizing hardware and middleware in the design of future AI systems.

Date

Publication

FAST 2025

Share