Optimizing MPI collectives using efficient intra-node communication techniques over the Blue Gene/P supercomputer

Amith R. Mamidala; Daniel Faraj; Sameer Kumar; Douglas Miller; Michael Blocksome; Thomas Gooding; Philip Heidelberger; Gabor Dozsa

doi:10.1109/IPDPS.2011.220

IPDPSW 2011

Conference paper

20 Dec 2011

Optimizing MPI collectives using efficient intra-node communication techniques over the Blue Gene/P supercomputer

View publication

Abstract

The Blue Gene/P (BG/P) supercomputer consists of thousands of compute nodes interconnected by multiple networks. Out of these, a 3D torus equipped with direct memory access (DMA) engine is the primary network. BG/P also features a collective network which supports hardware accelerated collective operations such as broadcast and allre-duce. One of the operating modes on BG/P is the virtual node mode where the four cores can be active MPI tasks, performing inter-node and intra-node communication. This paper proposes software techniques to enhance MPI Collective communication primitives, MPI Bcast and MPI Allreduce in virtual node mode by using cache coherent memory subsystem as the communication method within the node. The paper describes techniques leveraging atomic operations to design concurrent data structures such as broadcast-FIFOs to enable efficient collectives. Such mechanisms are important as we expect the core counts to rise in the future and having such data structures makes programming easier and efficient. We also demonstrate the utility of shared address space techniques for MPI collectives, wherein a process can access the peer's memory by specialized system calls. Apart from cutting down the copy costs, such techniques allow for seamless integration of network protocols with intra-node communication methods. We propose intra-node extensions to multi-color network algorithms for collectives using light weight synchronizing structures and atomic operations. Further, we demonstrate that shared address techniques allow for good load balancing and are critical for efficiently using the hardware collective network on BG/P. When compared to current approaches on the 3D torus, our optimizations provide performance up to almost 3 folds for MPI Bcast and a 33% performance gain for MPI Allreduce (in virtual node mode). We also see improvements up to 44% for MPI Bcast using the collective tree network. © 2011 IEEE.

Conference paper