About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ASP-DAC 2014
Conference paper
Soft error resiliency characterization on IBM blueGene/Q processor
Abstract
The principal goal of this project is to understand the soft error resiliency of BG/Q hardware when running high-performance applications. Based on the fails we encountered in the beaming experiment, we projected a mean-time-between-failure (MTBF) for a 20 PetaFLOP, 96 rack system running a comparable workload mix. The expected MTBF for detectableand- uncorrectable failures is calculated to be 51 days for the sea-level. The worst-case expected detectableand- uncorrectable MTBF is calculated to be 35 days assuming the system runs QCD, with 20% utilization of the messaging unit (MU). These are outstanding results for a machine of this magnitude. The beaming experiment and projected MTBFs significantly validate the necessity to include autonomous hardware detection and recovery techniques at thecost of design effort, silicon area and power.