Soft error resiliency characterization on IBM blueGene/Q processor

Chen-Yong Cher; K. Paul Muller; Ruud A. Haring; David L. Satterfield; Thomas E. Musta; Thomas M. Gooding; Kristan D. Davis; Marc B. Dombrowa; Gerard V. Kopcsay; Robert M. Senger; Yutaka Sugawara; Krishnan Sugavanam

doi:10.1109/ASPDAC.2014.6742920

ASP-DAC 2014

Conference paper

27 Mar 2014

Soft error resiliency characterization on IBM blueGene/Q processor

View publication

Abstract

The principal goal of this project is to understand the soft error resiliency of BG/Q hardware when running high-performance applications. Based on the fails we encountered in the beaming experiment, we projected a mean-time-between-failure (MTBF) for a 20 PetaFLOP, 96 rack system running a comparable workload mix. The expected MTBF for detectableand- uncorrectable failures is calculated to be 51 days for the sea-level. The worst-case expected detectableand- uncorrectable MTBF is calculated to be 35 days assuming the system runs QCD, with 20% utilization of the messaging unit (MU). These are outstanding results for a machine of this magnitude. The beaming experiment and projected MTBFs significantly validate the necessity to include autonomous hardware detection and recovery techniques at thecost of design effort, silicon area and power.

Conference paper