Marcelo Amaral, Tatsuhiro Chiba, et al.
CLOUD 2022
A long-standing goal in both industry and academia is to develop an LLM inference platform that is portable across hardware architectures, eliminates the need for low-level hand-tuning, and still delivers best-in-class efficiency. In this work, we demonstrate that portable, efficient cross-platform LLM inference is indeed possible and share our experience. We develop a state-of-the-art paged attention kernel, the core performance-critical component of many LLM deployments, that builds exclusively on the domain-specific just-in-time compiled language Triton to achieve state-of-the-art performance on both NVIDIA and AMD GPUs. We integrated our work as so called "Triton Backend" into vLLM, the de-facto standard engine for LLM inference, where it became the default for AMD deployments. We describe our high-level approach, the key algorithmic and system-level improvements and the parameter auto-tuning required to unlock efficiency, as well as vllm and cross-platform specific changes that are necessary to bring the performance of a generic Triton attention kernel from 19.7% of the state-of-the-art to 100.7%. Our results highlight how open-source domain-specific languages can be leveraged to unlock model portability across different GPU vendors.
Marcelo Amaral, Tatsuhiro Chiba, et al.
CLOUD 2022
Pranjal Gupta, Karan Bhukar, et al.
ICPE 2025
Abhishek Malvankar, Olivier Tardieu
KubeCon EU 2024
Darya Kaviani, Sijun Tan, et al.
RWC 2025