Blocking optimization techniques for sparse tensor computation
We present a detailed analysis of the sparse matricized tensor times Khatri-Rao product (MTTKRP) kernel that is the key bottleneck in various sparse tensor computations. By using the well-known roofline model and carefully instrumenting the state-of-The-Art MTTKRP code with the pressure point analysis technique, we show that the performance of MTTKRP is highly sensitive to data traffic like other sparse computations. We also identify key performance bottlenecks that diverge from our prior knowledge of sparse MTTKRP on modern processors. We propose to use blocking optimization techniques to address the bottlenecks identified within the MTTKRP computation. By using a combination of two blocking techniques, we achieve more than 3.5x and 2.0x speedup over the stateof-The-Art MTTKRP implementation in the SPLATT library on four real-world data sets and two synthetically generated data sets, respectively. We also employ a new data partitioning technique for the distributed MTTKRP implementation, which provides an additional dimension of scalability. As a result, our implementation exhibits good strong scaling and a 1.6x speedup on 64 nodes for two real-world data sets, when compared to the SPLATT library.