Optimizing distributed architectures to improve performance on checkpointing applications
Abstract
Nowadays, satisfying the global throughput targets of each application in High Performance Computing systems is a difficult task because of the high number of architectural configurations having a considerable impact on the overall system performance, such as the number of storage servers, features of the communication links, number of CPU cores per node, etc. In this paper we have performed a thorough study of the compared performance of scaling up HPC cluster architectures using a checkpointing application model. This study is specifically focused on multi-core HPC clusters and the scaling process is oriented towards the three main resources: computing power, communications and storage. The main goal of this work is to evaluate and analyze how evolves both scalability and bottlenecks existent on different HPC multi-core architectures using different architectural configurations. In order to achieve this goal, a set of simulation experiments has been achieved using a simulation framework, called SIMCAN, specifically designed for modeling and simulating HPC architectures. The results obtained show that the computing power is well suited thanks to the multi-core processors, while the problems are found on the storage and on the communications channels, being the storage network the main bottleneck. © 2011 IEEE.