About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
IWQoS 2015
Conference paper
Catching failures of failures at big-data clusters: A two-level neural network approach
Abstract
Big-data applications are becoming the core of today's business operations, featuring complex data structures and high task fan-out. According to the publicly available Google trace, more than 40% of big-data jobs do not reach successful completion. Interestingly, a significant portion of tasks of such failed jobs undergo multiple types of repetitive failed executions and consume a non-negligible amount of resources. To conserve resources for big-data clusters, it is imperative to capture such failed tasks of failed jobs, a very challenging problem due to multiple types of failures associated with tasks and highly uneven tasks distribution. In this paper, we develop an on-line two-level Neural Network (NN) model which can accurately untangle the complex dependencies among tasks and jobs, and predict their execution classes in an extremely dynamic and heterogeneous system. Our proposed NN model predicts first the job class, and secondly three classes of failed tasks of failed jobs, based on a sliding learning window. Furthermore, we develop resource conservation policies that terminate failed tasks of failed jobs after a grace period that is derived from prediction confidences and task execution times. Overall, evaluating our results on a Google cluster trace, we are able to accurately capture failures of failures at big-data clusters, mitigate false negative tasks to 1%, and efficiently save system resources, achieving significant reductions of CPU, memory and disk consumption - as high as 49%.