ASP-DAC 2014
Conference paper

Soft error resiliency characterization on IBM blueGene/Q processor

The principal goal of this project is to understand the soft error resiliency of BG/Q hardware when running high-performance applications. Based on the fails we encountered in the beaming experiment, we projected a mean-time-between-failure (MTBF) for a 20 PetaFLOP, 96 rack system running a comparable workload mix. The expected MTBF for detectableand- uncorrectable failures is calculated to be 51 days for the sea-level. The worst-case expected detectableand- uncorrectable MTBF is calculated to be 35 days assuming the system runs QCD, with 20% utilization of the messaging unit (MU). These are outstanding results for a machine of this magnitude. The beaming experiment and projected MTBFs significantly validate the necessity to include autonomous hardware detection and recovery techniques at thecost of design effort, silicon area and power.