Log-based Abnormal Task Detection and Root Cause Analysis for Spark
Application delays caused by abnormal tasks arecommon problems in big data computing frameworks. Anabnormal task in Spark, which may run slowly withouterror or warning logs, not only reduces its resident node'sperformance, but also affects other nodes' efficiency.Spark log files report neither root causes of abnormal tasks,nor where and when abnormal scenarios happen. AlthoughSpark provides a 'speculation' mechanism to detect stragglertasks, it can only detect tailed stragglers in each stage. Sincethe root causes of abnormal happening are complicated, thereare no effective ways to detect root causes.This paper proposes an approach to detect abnormality andanalyzes root causes using Spark log files. Unlike commononline monitoring or analysis tools, our approach is a pureoff-line method that can analyze abnormality accurately. Ourapproach consists of four steps. First, a parser preprocessesraw log files to generate structured log data. Second, ineach stage of Spark application, we choose features relatedto execution time and data locality of each task, as well asmemory usage and garbage collection of each node. Third,based on the selected features, we detect where and whenabnormalities happen. Finally, we analyze the problems usingweighted factors to decide the probability of root causes. In thispaper, we consider four potential root causes of abnormalities,which include CPU, memory, network, and disk. The proposedmethod has been tested on real-world Spark benchmarks.To simulate various scenario of root causes, we conductedinterference injections related to CPU, memory, network,and Disk. Our experimental results show that the proposedapproach is accurate on detecting abnormal tasks as well asfinding the root causes.