H2O: A hybrid and hierarchical outlier detection method for large scale data protection

Quan Zhang; Mu Qiao; Ramani Routray; Weisong Shi

doi:10.1109/BigData.2016.7840715

Big Data 2016

Conference paper

05 Dec 2016

H₂O: A hybrid and hierarchical outlier detection method for large scale data protection

View publication

Abstract

Data protection is the process of backing up data in case of a data loss event. It is one of the most critical routine activities for every organization. Detecting abnormal backup jobs is important to prevent data protection failures and ensure the service quality. Given the large scale backup endpoints and the variety of backup jobs, from a backup-as-a-service provider viewpoint, we need a scalable and flexible outlier detection method that can model a huge number of objects and well capture their diverse patterns. In this paper, we introduce H2O, a novel hybrid and hierarchical method to detect outliers from millions of backup jobs for large scale data protection. Our method automatically selects an ensemble of outlier detection models for each multivariate time series composed by the backup metrics collected for each backup endpoint by learning their exhibited characteristics. Interactions among multiple variables are considered to better detect true outliers and reduce false positives. In particular, a new seasonal-trend decomposition based outlier detection method is developed, considering the interactions among variables in the form of common trends, which is robust to the presence of outliers in the training data. The model selection process is hierarchical, following a global to local fashion. The final outlier is determined through an ensemble learning by multiple models. Built on top of Apache Spark, H2O has been deployed to detect outliers in a large and complex data protection environment with more than 600,000 backup endpoints and 3 million daily backup jobs. To the best of our knowledge, this is the first work that selects and constructs large scale outlier detection models for multivariate time series on big data platforms.

Conference paper