LADRA: Log-based abnormal task detection and root-cause analysis in big data processing with Spark
As big data processing is being widely adopted by many domains, massive amount of generated data become more reliant on the parallel computing platforms for analysis, wherein Spark is one of the most widely used frameworks. Spark's abnormal tasks may cause significant performance degradation, and it is extremely challenging to detect and diagnose the root causes. To that end, we propose an innovative tool, named LADRA, for log-based abnormal tasks detection and root-cause analysis using Spark logs. In LADRA, a log parser first converts raw log files into structured data and extracts features. Then, a detection method is proposed to detect where and when abnormal tasks happen. In order to analyze root causes we further extract pre-defined factors based on these features. Finally, we leverage General Regression Neural Network (GRNN) to identify root causes for abnormal tasks. The likelihood of reported root causes are presented to users according to the weighted factors by GRNN. LADRA is an off-line tool that can accurately analyze abnormality without extra monitoring overhead. Four potential root causes, i.e., CPU, memory, network, and disk I/O, are considered. We have tested LADRA atop of three Spark benchmarks by injecting aforementioned root causes. Experimental results show that our proposed approach is more accurate in the root cause analysis than other existing methods.