A memory heterogeneity-aware runtime system for bandwidth-sensitive hpc applications
Today's supercomputers are moving towards deployment of many-core processors like Intel Xeon Phi Knights Landing (KNL), to deliver high compute and memory capacity. Applications executing on such many-core platforms with improved vectorization require high memory bandwidth. To improve performance, architectures like Knights Landing include a high bandwidth and low capacity in-package high bandwidth memory (HBM) in addition to the high capacity but low bandwidth DDR4. Other architectures like Nvidia's Pascal GPU also expose similar stacked DRAM. In architectures with heterogeneity in memory types within a node, efficient allocation and data movement can result in improved performance and energy savings in future systems if all the data requests are served from the high bandwidth memory. In this paper, we propose a memory-heterogeneity aware runtime system which guides data prefetch and eviction such that data can be accessed at high bandwidth for applications whose entire working set does not fit within the high bandwidth memory and data needs to be moved among different memory types. We implement a data movement mechanism managed by the runtime system which allows applications to run efficiently on architectures with heterogeneous memory hierarchy, with trivial code changes. We show upto 2X improvement in execution time for Stencil3D and Matrix Multiplication which are important HPC kernels.