Reliability challenges and system performance at the architecture level
Abstract
The reliability challenges and the system performance at the architecture level are discussed. In modern computer systems, power and energy are the primary design constraints that has increased the use of inline concurrent error detection (CED) techniques, both hardware and software to achieve comparable reliability to that of modular redundancy. Modern processors use a variety of CED techniques and parity checking for detection of data path errors as well as for some register files. The introduction of hybrid systems with accelerators, along with widely used commodity off-the-shelf (COTS) low-power components enables the software-level techniques to provide efficient reliability. In memory subsystems, error detection can be performed through parity checking or error correcting codes (ECC). Memory scrubbing corrects single-bit errors in the background while memory is idle, thus preventing multibit errors. Architectural techniques and mechanisms must be incorporated in the design process, for both ease of design and cost reduction in building robust systems.