Performance optimization of load imbalanced workloads in large scale Dragonfly systems

Bogdan Prisacari; German Rodriguez; Cyriel Minkenberg; Marina Garcia; Enrique Vallejo; Ramon Beivide

doi:10.1109/HPSR.2015.7483107

HPSR 2015

Conference paper

01 Jun 2016

Performance optimization of load imbalanced workloads in large scale Dragonfly systems

View publication

Abstract

Dragonfly topologies are one of the most promising interconnect designs for enabling large, potentially exascale compute systems, particularly those envisioned to accommodate workloads that are sensitive to system diameter and end-to-end latency. They are cost-effective designs with a very low diameter and close to optimal performance for workloads which induce a balanced load across the network. However, these benefits are balanced by a reduced path diversity, which leaves Dragonflies vulnerable to certain adversarial traffic patterns. The performance of such workloads can be significantly improved using indirect routing approaches. However, the indirect routing approach that is most commonly used today exhibits in turn significant vulnerability to a subset of these traffic patterns for reasons that have not been, up to now entirely, understood. In exploring this vulnerability, we manage to provide a theoretical justification, based on inherent properties of the Dragonfly topology, of why performance degrades. Furthermore, we manage to isolate what specifically in the structure of a traffic pattern makes it a worst case in this context, and thus we are able to characterize the precise workload subset that will experience poor performance. By building upon the understanding of the interaction that causes sub-optimal behavior, we then show how simple changes to either the routing strategy or the process to node assignment can bring performance back close to ideal levels. Finally, we not only provide a theoretical justification for our performance models, but also validate them via comprehensive simulation-based studies of systems with up to 16,512 nodes.

Conference paper