About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
Journal of the ACM
Paper
Optimal design and use of retry in fault-tolerant computer systems
Abstract
In this paper, a new method is presented for (i) determining an optimal retry policy and (ii) using retry for fault characterization, which is defined as classification of the fault type and determination of fault durations. First, an optimal retry policy is derived for a given fault characteristic, which determines the maximum allowable retry durations so as to minimize the total task completion time. Then, the combined fault characterization and retry decision, in which the characteristic of a fault is estimated simultaneously with the determination of the optimal retry policy, are carried out. Two solution approaches are developed: one is based on point estimation and the other on Bayes sequential decision analysis. Numerical examples are presented in which all the durations associated with faults (i.e., active, benign, and interfailure durations) have monotone hazard rate functions (e.g., exponential Weibull and gamma distributions). These are standard distributions commonly used for modeling and analyses of faults. © 1988, ACM. All rights reserved.