Estimating system availability and reliability

Ambuj Goyal

WSC 1989

Conference paper

01 Dec 1989

Estimating system availability and reliability

Abstract

Methods for constructing and solving large Markov chain models of computer system availability and reliability are addressed. A set of powerful high-level modeling constructs is discussed that can be used to represent the failure and repair behavior of the components that constitute a system, including important component interactions. If time-independent failure and repair rates are assumed, then a time-homogeneous continuous-time Markov chain can be constructed automatically from the modeling constructs used to describe the system. Since the size of a Markov chain grows exponentially with the number of components modeled, simulation appears to be a practical way for solving models of large systems. However, the standard simulation requires very long simulation runs to estimate availability and reliability measures because the system failure event is a rare event. Therefore, variance reduction techniques which can aid in computing rare-event probabilities quickly are of interest. The importance sampling technique has been found to be most useful. The modeling language and the simulation methods discussed have been implemented in a program package called the System Availability Estimator (SAVE).

Paper