A Rigorous Approach to Fault-Tolerant Programming
Abstract
The design of programs that are tolerant of hardware fault occurrences and processor crashes is investigated. Using a stable storage management system as a running example, a new approach is suggested for specifying, understanding, and verifying the correctness of faulttolerant software. The approach extends previously developed axiomatic reasoning methods to the design of fault-tolerant systems by modeling faults as being operations that are performed at random time intervals on any computing system by the system's adverse environment. A clear separation is made between the notions of software correctness and system reliability in the face of hardware malfunction. Fault hypotheses are specified axiomatically and design correctness is proven by using a programming logic extended with fault axioms and rules. Stochastic modeling is employed to verify reliability/availability system properties. The combined correctness and reliability verifications establish that, under given fault and reliability hypotheses, a system behaves according to its functional specifications with a probability greater than that required by its reliability specifications. Copyright © 1985 by The Institute of Electrical and Electronics Engineers, Inc.