Journal of the ACM

Optimal design and use of retry in fault-tolerant computer systems

Download paper


In this paper, a new method is presented for (i) determining an optimal retry policy and (ii) using retry for fault characterization, which is defined as classification of the fault type and determination of fault durations. First, an optimal retry policy is derived for a given fault characteristic, which determines the maximum allowable retry durations so as to minimize the total task completion time. Then, the combined fault characterization and retry decision, in which the characteristic of a fault is estimated simultaneously with the determination of the optimal retry policy, are carried out. Two solution approaches are developed: one is based on point estimation and the other on Bayes sequential decision analysis. Numerical examples are presented in which all the durations associated with faults (i.e., active, benign, and interfailure durations) have monotone hazard rate functions (e.g., exponential Weibull and gamma distributions). These are standard distributions commonly used for modeling and analyses of faults. © 1988, ACM. All rights reserved.



Journal of the ACM