AI Compute Symposium 2025
- Yorktown Heights, NY, USA and virtual
IBM is proud to sponsor the PyTorch Conference 2025 – the world’s premier event dedicated to the framework powering today’s most groundbreaking AI innovations. Connect with AI pioneers, researchers, developers, and startup founders through deep-dive technical sessions, panels, workshops on AI from bare metal all the way up to the application and agent layers. Our program features keynotes from visionary AI leaders, interactive sessions on scaling and benchmarking models, and special tracks focusing on AI safety and ethical development.
Whether you’re an experienced ML engineer, researcher, or developer, PyTorch Conference 2025 is your gateway to the future of AI. Join the community that’s creating the AI revolution, not just witnessing it.
Developers:
Join informal discussion, provide feedback, and uncover opportunities to collaborate.
Mert Toslali & Yu Chin Fabian Lim, IBM Research
Training LLMs with online RL methods like GRPO presents a unique challenge: inference is required at every training step. In the standard Hugging Face TRL setup, inference is handled by vLLM running as a separate server on dedicated GPUs, communicating via HTTP. This creates a “ping-pong” inefficiency—training GPUs wait during generation, and inference GPUs wait during training—leading to poor GPU utilization and high cost.
Our talk introduces co-located vLLM, a key optimization that enables training and inference to run on the same GPUs. Built on vLLM’s external_launcher, it allows in-process, torch-compatible execution. We contributed a now-merged PR to TRL that eliminates the need for HTTP calls or separate servers. Our setup supports torchrun, TP/DP, and scales to training large models (like 72B). This setup improves training throughput by up to 1.7×, reduces # of GPUs needed, and is now part of the official TRL repo.
Andrea Frittoli, IBM
Session: Poster Presentations - Responsible AI & Community
Cong Liu, Google; Carlos Costa, IBM
Session: Poster Presentations - Generative & Large Models
Maroon Ayoub, IBM & Tyler Michael Smith, Red Hat
Session: Poster Presentations - Generative & Large Models
Martin Hickey, IBM & Junchen Jiang, University of Chicago
Session: Poster Presentations - Generative & Large Models
Mehant Kammakomati, IBM Research; Amal Joe R S, IIT Bombay
Session: Poster Presentations - Generative & Large Models
Yidi Wu, Meta & Thomas Ortner, IBM Research Europe
Session: Poster Presentations - PyTorch Core
Sahdev Zala, IBM
Session: Poster Presentations - PyTorch Core
Maroon Ayoub, IBM Research & Cong Liu, Google
As PyTorch-based LLMs scale in complexity and user concurrency, their inference demands diverge across stages. Prefill is compute-heavy; decode is latency-sensitive. In this talk, we introduce a disaggregated serving pattern for PyTorch LLMs using llm-d—a Kubernetes-native, open-source framework co-developed by IBM Research, Google, and Red Hat. We'll walk through how llm-d separates prefill and decode into orchestrated sidecars, improving GPU utilization and QoS alignment. You'll learn how the Gateway API Inference Extension (GIE) enables routing based on load, cache locality, and session affinity. The talk includes real-world benchmarks and a visual demo of llm-d serving PyTorch models with vLLM across heterogeneous hardware on Kubernetes.
Burkhard Ringlein, IBM
Today, vLLM (part of pytorch) is the de-facto industry standard for serving Large Language Models. vLLM is increasingly being adopted in production and can be executed on NVIDIA GPUS, AMD GPUs, as well as custom accelerators like AWS Inferentia.
However, for most of the past, vLLM’s state-of-the-art performance war largely depending on a number of hand-written CUDA or HIP kernels. These kernels have typically been carefully optimized for a specific GPU platform and may pose a serious obstacle to the portability of vLLM across different hardware.
Leveraging Open AI Triton, we were able to introduce a “triton backend” to vllm that produces state-of-the-art performance across GPU platforms with a single code base, without involving hand written CUDA or HIP kernels.
In this talk, we will present our recent advances that lead to state-of-the-art performance on both NVIDIA and AMD GPUs with a single Triton-only code-base. We will present the engineering and science behind this triton-only backend, including autotuning for different platforms, system aspects like the launch overhead of Tritons Just-in-time compiler, and different kernel optimizations.