Conference paper

The resilience wall: Cross-layer solution strategies

View publication


Resilience to hardware failures is a key challenge for a large class of future computing systems that are constrained by the so-called power wall: from embedded systems to supercomputers. Today's mainstream computing systems typically assume that transistors and interconnects operate correctly during useful system lifetime. With enormous complexity and significantly increased vulnerability to failures compared to the past, future system designs cannot rely on such assumptions. At the same time, there is explosive growth in our dependency on such systems. To overcome this outstanding challenge, this paper advocates and examines a cross-layer resilience approach. Two major components of this approach are: 1. System and software-level effects of circuit-level faults are considered from early stages of system design; and, 2. resilience techniques are implemented across multiple layers of the system stack - from circuit and architecture levels to runtime and applications - such that they work together to achieve required degrees of resilience in a highly energy-efficient manner. Illustrative examples to demonstrate key aspects of cross-layer resilience are discussed. © 2014 IEEE.