Ilias Iliadis
International Journal On Advances In Networks And Services
Large Language Model (LLM) training workloads share computational characteristics with high-performance computing applications, requiring intensive parallel processing, complex matrix operations, and distributed computing with frequent synchronization -- requiring specialized hardware to deliver optimal performance.
This talk presents insights from Vela, a cloud-native system architecture introduced in 2021 for LLM training using commercial hardware and open-source software. The Vela architecture combines off-the-shelf hardware, Linux KVM virtualization with PCIe passthrough, and virtualized RDMA over Converged Ethernet networks. The system employs software-defined networking with SRIOV technology for GPU Direct RDMA, achieving near-bare-metal performance while maintaining virtualization benefits.
Based on multiple data center deployments and iterations, we present two case studies examining what it takes for virtualization-based systems to deliver (a) bare-metal RoCE-like performance and (b) bare-metal InfiniBand-like performance for LLM training workloads. The discussion focuses on virtualization challenges, experiences, and runtime optimizations required for optimal performance in cloud-native training infrastructure.
Ilias Iliadis
International Journal On Advances In Networks And Services
Alessandro Pomponio
Kubecon + CloudNativeCon NA 2025
Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024
Vadim Elisseev, Robert Firth, et al.
SC 2025