Fast matrix multiplication via compiler-only layered data reorganization and intrinsic lowering

Braedy Kuzma; Ivan Korostelev; João P. L. de Carvalho; José E. Moreira; Christopher Barton; Guido Araujo; José Nelson Amaral

doi:10.1002/spe.3214

Software - Practice and Experience

Paper

14 May 2023

Fast matrix multiplication via compiler-only layered data reorganization and intrinsic lowering

View publication

Abstract

The resurgence of machine learning has increased the demand for high-performance basic linear algebra subroutines (BLAS), which have long depended on libraries to achieve peak performance on commodity hardware. High-performance BLAS implementations rely on a layered approach that consists of tiling and packing layers—for data (re)organization—and micro kernels that perform the actual computations. The algorithm for the tiling and packing layers is target independent but is parameterized to the memory hierarchy and register-file size. The creation of high-performance micro kernels requires significant development effort to write tailored assembly code for each architecture. This hand optimization task is complicated by the recent introduction of matrix engines by (Figure presented.) 's (Figure presented.) (Matrix Multiply Assist—MMA), (Figure presented.) (Advanced Matrix eXtensions—AMX), and (Figure presented.) (Matrix Extensions—ME) to deliver high-performance matrix operations. This article presents a compiler-only alternative to the use of high-performance libraries by incorporating, to the best of our knowledge and for the first time, the automatic generation of the layered approach into LLVM, a production compiler. Modular design of the algorithm, such as the use of LLVM's matrix-multiply intrinsic for a clear interface between the tiling and packing layers and the micro kernel, makes it easy to retarget the code generation to multiple accelerators. The parameterization of the tiling and packing layers is demonstrated in the generation of code for the MMA unit on IBM's POWER10. This article also describes an algorithm that lowers the matrix-multiply intrinsic to the MMA unit. The use of intrinsics enables a comprehensive performance study. In processors without hardware matrix engines, the tiling and packing delivers performance up to (Figure presented.) (Intel)—for small matrices—and more than (Figure presented.) (POWER9)—for large matrices—faster than PLuTo, a widely used polyhedral optimizer. The performance also approaches high-performance libraries and is only (Figure presented.) slower than OpenBLAS and on-par with Eigen for large matrices. With MMA in POWER10 this solution is, for large matrices, over (Figure presented.) faster the vector-extension solution, matches Eigen performance, and achieves up to (Figure presented.) of BLAS peak performance.

Paper

YaConv: Convolution with Low Cache Footprint

Ivan Korostelev, Joao P. L. de Carvalho, et al.

ACM TACO

View all publications

Abstract

Related

YaConv: Convolution with Low Cache Footprint