Techniques to improve the scalability of collective checkpointing at large scale

Bogdan Nicolae

doi:10.1109/HPCSim.2015.7237113

HPCS 2015

Conference paper

02 Sep 2015

Techniques to improve the scalability of collective checkpointing at large scale

View publication

Abstract

Scientific and data-intensive computing have matured over the last couple of years in all fields of science and industry. Their rapid increase in complexity and scale has prompted ongoing efforts dedicated to reach exascale infrastructure capability by the end of the decade. However, advances in this context are not homogeneous: I/O capabilities in terms of networking and storage are lagging behind computational power and are often considered a major limitation that that persists even at petascale [1].

Conference paper