Make Kubernetes Networking Ready for world class AI and HPC workloads
While use of Kubernetes for various services is growing rapidly, it is still behind in the world of HPC and AIclusters. Part of the reason is that the lack of support for advanced features like multiple 100G networksavailable in HPC/AI Systems. Vast majority of AI systems in hyperscalers such as IBM Cloud, AWS, Azure,and Oracle Cloud come with two to 8 100G network interfaces on the A100 GPU nodes. However, bydefault in Kubernetes, a pod has only one network interface, but attaching multiple interfaces is often arequirement in the scenarios. Multus unlocks the potential of multi-networking feature in Kubernetes, butthere are still challenges in usability, manageability, and scalability. We present Multi-NIC CNI, a newopen-source project, to democratize multiple interfaces capability for everyone. This CNI saves usersfrom the concerns regarding environment heterogeneity and acquiring CNI specific knowledge. This talkwill introduce the architecture, use cases, and performance of the CNI, then show how beneficial it is forHPC/AI. We will demonstrate the CNI on a large scale GPU Cluster consisting of over 1400 GPUs and two100G network interfaces that we build in IBM Cloud.