Publication
DSN 2008
Conference paper

Evaluating availability under quasi-heavy-tailed repair times

View publication

Abstract

The time required to recover from failures has a great impact on the availability of Information Technology (IT) systems. We define a class of probability distributions named quasi-heavy-tailed distributions as those distributions whose time series graph of the sample mean shows intermittent jumps in a given period. We find that the distribution of repair time is quasi-heavy-tailed for three IT systems, an in-house system hosted by IBM, a high performance computing system at the Los Alamos National Laboratory, and a distributed memory computer at the National Energy Research Scientific Computing Center. This means that the mean time to repair estimated by observing incidents within a certain period could dramatically change if we observe incidents successively for another period. In other words, the estimated mean time to repair has large fluctuations over time. As a result, classical metrics based on the mean time to repair are not optimal for evaluating the availability of these systems. We propose to evaluate the availability of IT systems with the T-year return value, estimated based on extreme value theory. The T-year return value refers to the value that the repair time exceeds on average once every estimated T years. We find that the T-year return value is a sound metric of the availability of the three IT systems. © 2008 IEEE.

Date

Publication

DSN 2008

Authors

Share