About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
IEEE TC
Paper
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor
Abstract
Hypercube multiprocessors have recently offered a cost effective and feasible approach to supercomputing through parallelism at the processor level by directly connecting a large number of low-cost processors with local memories which communicate by message-passing instead of shared variables. This paper discusses the design of a fault-tolerant hypercube multiprocessor architecture. Most of the recently proposed schemes of fault tolerance in parallel architectures address mainly the issue of reconfiguration of a parallel architecture once a faulty processor is identified. The schemes assume the existence of an off-line diagnosis strategy which locates the faulty processor. We propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. We have implemented system-level error detection mechanisms for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: 1) matrix multiplication, 2) Gaussian elimination, and 3) fast Fourier transform. Schemes for other applications are under development. We have performed extensive studies of error coverage of our system-level error detection schemes in the presence of finite precision arithmetic which affects our system-level encodings. Finally, the paper proposes two reconfiguration schemes that allow us to isolate and replace faulty processors with spare processors. These schemes of reconfiguration are integrated with the error detection schemes to form a truly fault-tolerant hypercube multiprocessor. © 1990 IEEE