About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
DSN 2005
Conference paper
Probabilistic QoS guarantees for supercomputing systems
Abstract
Supercomputing systems must be able to reliably and efficiently complete their assigned workloads, even in the presence of failures. This paper proposes a system that allows the system and users to negotiate a mutually desirable risk strategy; in order to accomplish this, the system makes probabilistic guarantees on quality of service (QoS), of the form, "Job j can be completed by deadline d with probability p." In order to make such guarantees, the system uses event prediction (forecasting) in conjunction with fault-aware job scheduling and cooperative check-pointing strategies. Using job logs and failure traces from actual high performance computing systems, we employ trace-based simulations to assess the effects of the prediction accuracy (a) and user risk strategy (U) on a variety of performance metrics. Compared to a system that does not use event prediction, a high forecasting accuracy resulted in QoS and utilization improvements of as much as 6%, along with an 89% reduction in the amount of lost work. Therefore, our results show that a system that makes probabilistic QoS guarantees using a market-based scheduling approach can increase both system performance and reliability. © 2005 IEEE.