IBM at PyTorch 2025

  • San Francisco, CA, USA
This event has ended.

About

IBM is proud to sponsor the PyTorch Conference 2025 – the world’s premier event dedicated to the framework powering today’s most groundbreaking AI innovations. Connect with AI pioneers, researchers, developers, and startup founders through deep-dive technical sessions, panels, workshops on AI from bare metal all the way up to the application and agent layers. Our program features keynotes from visionary AI leaders, interactive sessions on scaling and benchmarking models, and special tracks focusing on AI safety and ethical development.

Whether you’re an experienced ML engineer, researcher, or developer, PyTorch Conference 2025 is your gateway to the future of AI. Join the community that’s creating the AI revolution, not just witnessing it.


Why attend

Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.

_

What's Next?

Join us at IBM Z Day, Nov 12, 2025 - 08 AM to 05 PM (ET) - a free, 1-day enterprise computing virtual conference for anyone and everyone! Hear the latest about IBM Z and LinuxONE, and join our lineup of global thought leaders who will highlight industry trends and innovation spanning AI, Hybrid Cloud, Quantum-Safe Security, and more.

Looking for more from IBM Research?


Career opportunities


Agenda

  • Description:

    Speaker: Christian Jacobi, IBM Fellow

    From fraud detection to core banking, AI is reshaping mission-critical systems—see how PyTorch and IBM’s Spyre accelerator bring dataflow to the enterprise.

    AI at enterprise scale isn’t just about building bigger models—it’s about running them with the reliability, security, and performance that mission-critical workloads demand. IBM Fellow Christian Jacobi will share how IBM Z, LinuxONE, Power, and Storage systems are bringing AI directly into business operations, powering everything from fraud detection to RAG pipelines. He will also highlight the Spyre Accelerator—a scalable PCIe card for AI expansion—and show how its integration with PyTorch is enabling the development of secure, efficient, and resilient AI systems at scale.

    Speakers:
    CJ
    Christian Jacobi
    IBM Fellow and CTO, IBM Systems Development
    IBM
  • Description:

    Visit us at the IBM booth in the exhibit hall to talk to our researchers and see demos of our work.

  • Description:

    In this session, we will share our journey with TorchTitan over the past year and a half, starting from early 2024. During this journey, we went from using TorchTitan as a secondary codebase solely for throughput benchmarking to leveraging it for several internal production trainings; from being an end user to becoming an active contributor within the TorchTitan community.

    Our story will cover why we adopted TorchTitan in our production trainings, what we've accomplished with it, and what lies ahead. Highlights include training an in-house 70B model earlier this year that matches the performance of the LLaMA 3 family - while requiring significantly fewer GPU hours - thanks to the latest features such as FP8 training. We'll also discuss our current work with TorchTitan, including our ongoing MoE training enabled by integrating our fast MoE kernel into TorchTitan, as well as exploring additional MoE kernels with FP8 row-wise and MXFP8, which are currently being developed within the TorchTitan community.

    We’ll also share key lessons learned along the way and explain why we think this is a great community for everyone to explore and contribute to.

    Speaker(s): Linsong Chu & Garrett Goot

  • Description:

    Join informal discussion, provide feedback, and uncover opportunities to collaborate. Developers:

    • Andrea Fritolli: Open Source Developer Advocate from IBM, (specializing in CI/CD)
    • Thanh Ha: Engineer from Linux Foundation (CI/CD Infrastructure)
    • Jordan Conway: Engineer from Linux Foundation (CI/CD Infrastructure)
    • Zhe Zhang: Distinguished Engineer from NVIDIA (CI/CD Infrastructure)
    • Eli Uregas: Leads CI/CD infrastructure efforts in collaboration with PyTorch Foundation
    • Andrey Talman: Engineer (PyTorch releases)
    • Nikita Shulga: Core PyTorch OSS Developer and domain expert.
    • Anita Katahoire: Technical Program Manager leading PyTorch release activities
    • Yang Wang: Engineer (benchmarking, monitoring)
    • Armen Donigian: Engineering manager
  • Description:

    Mert Toslali & Yu Chin Fabian Lim, IBM Research 

    Training LLMs with online RL methods like GRPO presents a unique challenge: inference is required at every training step. In the standard Hugging Face TRL setup, inference is handled by vLLM running as a separate server on dedicated GPUs, communicating via HTTP. This creates a “ping-pong” inefficiency—training GPUs wait during generation, and inference GPUs wait during training—leading to poor GPU utilization and high cost.

    Our talk introduces co-located vLLM, a key optimization that enables training and inference to run on the same GPUs. Built on vLLM’s external_launcher, it allows in-process, torch-compatible execution. We contributed a now-merged PR to TRL that eliminates the need for HTTP calls or separate servers. Our setup supports torchrun, TP/DP, and scales to training large models (like 72B). This setup improves training throughput by up to 1.7×, reduces # of GPUs needed, and is now part of the official TRL repo.

  • Description:

    Martin Hickey, IBM & Junchen Jiang, University of Chicago

    Session: Poster Presentations - Generative & Large Models

  • Description:

    Sahdev Zala, IBM

    Session: Poster Presentations - PyTorch Core

  • Description:

    Poster Presentations: Generative & Large Models | Exhibit Hall

    Routing Stateful AI Workloads in Kubernetes: Optimizing PyTorch LLM Inference - Maroon Ayoub, IBM

  • Description:

    Mehant Kammakomati, IBM Research; Amal Joe R S, IIT Bombay

    Session: Poster Presentations - Generative & Large Models

    Authors:
    AJ
    Amal Joe R S
    IBM
    RJ
    Romit Jain
    IBM
  • Description:

    Maroon Ayoub, IBM & Tyler Michael Smith, Red Hat

    Session: Poster Presentations - Generative & Large Models

  • Description:

    Cong Liu, Google; Carlos Costa, IBM

    Session: Poster Presentations - Generative & Large Models

  • Description:

    Andrea Frittoli, IBM

    Session: Poster Presentations -  Responsible AI & Community

  • Description:

    Yidi Wu, Meta & Thomas Ortner, IBM Research Europe

    Session: Poster Presentations - PyTorch Core

More events