About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
SC 2013
Conference paper
Maximizing the performance of irregular applications on multithreaded, NUMA systems
Abstract
In modern shared-memory systems, the communication latency and available resources for a group of logical processors are determined by their relative position in the hierarchy of chips, cores, and hardware threads. Thus the performance of multithreaded applications varies with the mapping of software threads to logical processors. In our study we observe huge variation in application performance under different mappings. Moreover, applications with irregular access patterns perform poorly under the default mapping. We maximize application performance by balancing communication overhead and available resources. Remote access overhead in irregular applications dominates execution time and can not be reduced by mapping alone on NUMA systems when the logical processors span multiple chips. In addition to new data replication and distribution optimizations, we improve geographical locality by matching access pattern to the data layout. We further propose a locality-centric optimization for simultaneously reducing remote accesses and improving cache performance. Our approach achieves better performance than prior NUMA-specific techniques. © 2013 ACM.