Modeling Matrix Engines for Portability and Performance
Matrix engines, also known as matrix-multiplication accel-erators, capable of computing on 2D matrices of various data types are traditionally found only on GPUs. However, they are increasingly being introduced into CPU architectures to support AI/ML computations. Unlike traditional SIMD functional units, these accelerators require both the input and output data to be packed into a specific 2D-data layout that is often dependent on the input and output data types. Due to the large variety of supported data types and architectures, a common abstraction is required to unify these seemingly disparate accelerators and more efficiently produce high-performance code. In this paper, we show that the hardware characteristics of a vast array of different matrix engines can be unified using a single analytical model that casts matrix engines as an accumulation of multiple outer-products (also known as rank-k updates). This allows us to easily and quickly develop high-performance kernels using matrix engines for different architectures. We demonstrate our matrix engine model and its portability by applying it to two distinct architectures. Using our model, we show that high-performance computational kernels and packing routines required for high-performance dense linear algebra libraries can be easily designed. Furthermore, we show that the performance attained by our implementations is around 90-99 % (80-95 % on large problems) of the theoretical peak throughput of the matrix engines.