Utilizing Persistence for Post Facto Suppression of Invalid Anomalies Using System Logs
The robustness and availability of cloud services are becoming increasingly important as more applications migrate to the cloud. The operations landscape today is more complex, than ever. Site reliability engineers (SREs) are expected to handle more incidents than ever before with shorter service-level agreements (SLAs). By exploiting log, tracing, metric, and network data, Artificial Intelligence for IT Operations (AIOps) enables detection of faults and anomalous issues of services. A wide variety of anomaly detection techniques have been incorporated in various AIOps platforms (e.g. PCA and autoencoder), but they all suffer from false positives. In this paper, we propose an unsupervised approach for persistent anomaly detection on top of the traditional anomaly detection approaches, with the goal of reducing false positives and providing more trustworthy alerting signals. We test our method on both simulated and real-world datasets. Our technique reduces false positive anomalies by at least 28%, resulting in more reliable and trustworthy notifications.