Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems: A comparative study of GPU, Xeon Phi and FPGA
Hardware accelerators have evolved as the most prominent vehicle to meet the demanding performance and energy-efficiency constraints of modern computer systems. The prevalent type of hardware accelerators in the high-performance computing domain are PCIe attached co-processors to which the CPU can offload compute intensive tasks. In this paper, we analyze the performance, power, and energy-efficiency of such accelerators for sparse matrix multiplication kernels. Improving the efficiency for sparse matrix operations is of eminent importance since they work at the core of graph analytics algorithms which are in turn key to many big data knowledge discovery workloads. Our study involves GPU, Xeon Phi, and FPGA co-processors to embrace the vast majority of hardware accelerators applied in modern HPC systems. In order to compare the devices on the same level of implementation quality we apply vendor optimized libraries for which published results exist. From our experiments we deduce that none of the compared devices generally dominates in terms of energy-efficiency and that the optimal solutions depends on the actual sparse matrix data, data transfer requirements and on the applied efficiency metric. We also show that a combined use of multiple accelerators can further improve the system's performance and efficiency by up to 11% and 18%, respectively.