About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
SBAC-PAD 2012
Conference paper
Network endpoints for clusters of SMPs
Abstract
Modern large scale parallel machines feature an increasingly deep hierarchy of interconnections. Individual processing cores employ simultaneous multithreading (SMT) to better exploit functional units, multiple coherent processors are collocated in a node to better exploit links to cache, memory and network (SMP), and multiple nodes are interconnected by specialized low latency/high speed networks. Current trends indicate ever wider SMP nodes in the future. To service these nodes, modern high performance network devices (including Infiniband and all of IBM's recent offerings) offer the ability to sub-divide the network devices' resources among the processing threads. System software, however, lags in exploiting these capabilities, leaving users of e.g., MPI[14], UPC[19] in a bind, requiring complex and fragile workarounds in user programs. In this paper we discuss our implementation of endpoints, the software paradigm central to the IBM PAMI messaging library [3]. A PAMI endpoint is an expression in software of a slice of the network device. System software can service endpoints without serializing the many threads on an SMP by forcing them through a critical section. In the paper we describe the basic guarantees offered by PAMI to the programmer, and how these can be used to enable efficient implementations of high level libraries and programming languages like UPC. We evaluate the efficiency of our implementation on a novel P7IHsystem with up to 4096 cores, running micro benchmarks designed to find performance deficiencies in the endpoints implementation of both point-to-point and collective functions. © 2012 IEEE.