International Journal of Modelling and Simulation

Survey of backward error recovery techniques for multicomputers based on checkpointing and rollback

Backward error recovery, based on checkpointing and rollback, is often used for implementing fault tolerance in multicomputer systems. During failure-free operation the process states are regularly saved, and after a fault is detected the system is rolled back to a previously saved state. Four classes of techniques can be distinguished: semiautomatic techniques, message logging, coordinated checkpointing, and hybrid techniques. The authors provide a survey of these alternatives and discuss the overhead possibly involved, allowing the user to choose an optimal checkpointing and rollback technique for given facilities and applications.