Publication
ISPASS 2008
Conference paper

Metrics for architecture-level lifetime reliability analysis

View publication

Abstract

This work concerns metrics for evaluating microarchitectural enhancements to improve processor lifetime reliability. A commonly reported reliability metric is mean time to failure (MTTF). Although the MTTF metric is simpler to evaluate, it does not provide information on the reliability characteristics during the relatively short operational life of commodity processors. An alternate metric is nTTF, which represents the time to failure of n% of the processor population. nTTF is a more informative metric for the (short) portion of the lifetime that is relevant to the end-user, but determining it requires knowledge of the distribution of processor failure times which is generally hard to obtain. The goals of this paper are (1) to determine if the choice of metric has a quantitative impact on architecture-level reliability analysis and modern superscalar processor designs and (2) to build a fundamental understanding of why and when MTTF- and nTTF-driven analysis result in different designs. We show through an in-depth analysis that, in general, the nTTF metric differs significantly from the MTTF metric, and using MTTF as a proxy for nTTF leads to sub-optimal designs. Additionally, our analysis introduces the concept of relative vulnerability factor (RVF) for different processor components to guide reliability-aware design. We show that the difference between nTTF- and MTTF-driven design largely occurs because the relative vulnerabilities of the processor components change over the processor lifetime, making the optimal design choice dependent on the amount of time the processor is expected to be used. © 2008 IEEE.

Date

Publication

ISPASS 2008

Authors

Topics

Share