Hybrid Anomaly Detection and Prioritization for Network Logs at Cloud Scale
Monitoring the health of large-scale systems requires significant manual effort, usually through the continuous curation of alerting rules based on keywords, thresholds and regular expressions, which might generate a flood of mostly irrelevant alerts and obscure the actual information operators would like to see. Existing approaches try to improve the observability of systems by intelligently detecting anomalous situations. Such solutions surface anomalies that are statistically significant, but may not represent events that reliability engineers consider relevant. We propose ADEPTUS, a practical approach for detection of relevant health issues in an established system. ADEPTUS combines statistics and unsupervised learning to detect anomalies with supervised learning and heuristics to determine which of the detected anomalies are likely to be relevant to the Site Reliability Engineers (SREs). ADEPTUS overcomes the labor-intensive prerequisite of obtaining anomaly labels for supervised learning by automatically extracting information from historic alerts and incident tickets. We leverage ADEPTUS for observability in the network infrastructure of IBM Cloud. We perform an extensive real-world evaluation on 10 months of logs generated by tens of thousands of network devices across 11 data centers and demonstrate that ADEPTUS achieves higher alerting accuracy than the rule-based log alerting solution, curated by domain experts, used by SREs daily.