&quot;Memory loss&quot; in commodity hardware? Predicting DIMM failures with machine learning

Ioana Giurgiu; Dorothea Wiesmann; John Bird

doi:10.1145/3078468.3078486

SYSTOR 2017

Conference paper

22 May 2017

"Memory loss" in commodity hardware? Predicting DIMM failures with machine learning

View publication

Abstract

Failures of memory modules have been a concern for a long time, as they are costly both in terms of hardware replacement and service disruption. These failures can be preceded by correctable (soft) and then uncorrectable (hard) errors, which accumulate over time. Valuable large scale studies of DIMM errors in the wild [2, 1] analyze in depth hard and soft errors and their correlations with specific sensors. However, little has been reported on how these findings could be used to automatically predict future DIMM failures. We show that by understanding which factors drive such failures, we can build intelligent predictive models with off-theshelf machine learning techniques to predict DIMM failures ahead of time with high accuracy. Such models not only provide early signs of failures, but also allow administrators to proactively replace DIMMs at risk weeks in advance, thus avoiding "memory loss" of their commodity hardware.

Conference paper