Reliability assurance of RAID storage systems for a wide range of latent sector errors
Abstract
The low-cost disk drives, which are increasingly being adopted in today's data storage systems, have higher capacity but lower reliability, which leads to more frequent rebuilds and to a higher risk of unrecoverable or latent media errors. An intra-disk redundancy scheme has been proposed to cope with such errors and enhance the reliability of RAID systems. Empirical field results recently reported in the literature, however, suggest that the extent to which unrecoverable media errors occur is higher than the data sheet specifications provided by the disk manufacturers. Our results demonstrate that the reliability improvement due to intradisk redundancy is adversely affected because of the increase in the number of unrecoverable errors. We demonstrate that, by revising the parameter choice of the intradisk redundancy scheme, we can obtain essentially the same reliability as that of a system operating without unrecoverable sector errors. The I/O and throughput performance are evaluated by means of analysis and event-driven simulations. The effects of the spatial locality of errors and of the error-burst length distribution on the system reliability are also investigated. © 2008 IEEE.