Characterization and exploration of latch checkers for efficient RAS protection

Karthik Swaminathan; Ramon Bertran; Douglas Balazich; Alper Buyuktosunoglu; Arvind Haran; Sean Carey; Karl Anderson; Hans Jacobson; Matthias Pflanz; Pradip Bose

DSN 2023

Conference paper

27 Jun 2023

Characterization and exploration of latch checkers for efficient RAS protection

Visit website

Abstract

Reliability has been, and continues to be a key consideration in the design of the IBM Z mainframe processors, and has resulted in industry-leading performance with little-to-no downtime. In this paper, we analyze the various hardware reliability mechanisms that make the processor resilient to transient errors, and the checker architecture that enables their detection and correction. We characterize the error checking logic in the processor based on a detailed analysis of the actual design. Based on hardware measurements on a real Z processor, we then determine the error checkers that are critical from a timing standpoint, in the event where the supply voltage is scaled. We propose algorithms that optimize checker selection without affecting the RAS coverage and the detection of errors induced both due to SER and voltage scaling. Finally we examine further potential optimizations of checkers based on the logic utilization in representative benchmarks.

Paper