Leveraging NVLINK and asynchronous data transfer to scale beyond the memory capacity of GPUs
In this paper we demonstrate the utility of fast GPU to CPU interconnects to weak scale on hierarchical nodes without being limited to problem sizes that fit only in the GPU memory capacity. We show the speedup possible for a new regime of algorithms which traditionally have not benefited from being ported to GPUs because of an insufficient amount of computational work relative to bytes of data that must be transferred (offload intensity). This new capability is demonstrated with an example of our hierarchical GPU port of UMT, the 51K line CORAL benchmark application for Lawrence Livermore National Lab's radiation transport code. By overlapping data transfers and using the NVLINK connection between IBM POWER 8 CPUs and NVIDIA P100 GPUs, we demonstrate a speedup that continues even when scaling the problem size well beyond the memory capacity of the GPUs. Scaling to large local domains per MPI process is a necessary step to solving very large problems, and in the case of UMT, large local domains improve the convergence as the number of MPI ranks are weak scaled.