Asymmetric Resilience for Accelerator-Rich Systems

Jingwen Leng; Alper Buyuktosunoglu; Ramon Bertran; Pradip Bose; Vijay Janapa Reddi

doi:10.1109/LCA.2019.2917898

IEEE Computer Architecture Letters

Paper

01 Jan 2019

Asymmetric Resilience for Accelerator-Rich Systems

View publication

Abstract

Accelerators are becoming popular owing to their exceptional performance and power-efficiency. However, researchers are yet to pay close attention to their reliability-A key challenge as technology scaling makes building reliable systems challenging. A straightforward solution to make accelerators reliable is to design the accelerator from the ground-up to be reliable by itself. However, such a myopic view of the system, where each accelerator is designed in isolation, is unsustainable as the number of integrated accelerators continues to rise in SoCs. To address this challenge, we propose a paradigm called asymmetric resilience that avoids accelerator-specific reliability design. Instead, its core principle is to develop the reliable heterogeneous system around the CPU architecture. We explain the implications of architecting such a system and the modifications needed in a heterogeneous system to adopt such an approach. As an example, we demonstrate how to use asymmetric resilience to handle GPU execution errors using the CPU with minimal overhead. The general principles can be extended to include other accelerators.

Conference paper