Cooperative checkpointing: A robust approach to large-scale systems reliability

Adam J. Oliner; Larry Rudolph; Ramendra K. Sahoo

doi:10.1145/1183401.1183406

ICS 2006

Conference paper

01 Dec 2006

Cooperative checkpointing: A robust approach to large-scale systems reliability

View publication

Abstract

Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using both probabilistic and harvested failure distributions reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle. Cooperative checkpointing can be easily implemented on top of existing application-initiated checkpointing mechanisms and may be used to enhance other reliability techniques like QoS guarantees and fault-aware job scheduling. The simulations also support a number of theoretical predictions related to cooperative checkpointing, including the non-competitiveness of periodic checkpointing. Copyright 2006 ACM.

Conference paper