Metadata-based retrieval for resolution recommendation in AIOps
Abstract
For a cloud service provider, the goal is to proactively identify signals that can help reduce outages and/or reduce the mean-time-to-detect and mean-time-to-resolve. After an incident is reported, the Site Reliability Engineers diagnose the fault and search for a resolution by formulating a textual query to find similar historical incidents - this approach is called text-based retrieval. However, it has been observed that the formulated queries are inadequate and short. An alternate approach, presented in this paper, integrates information spread across heterogeneous and siloed datasets, as a ready-to-use knowledge base for metadata-based resolution retrieval. Additionally, it exploits historical problem context for building metadata prediction models which are used at run-time for automatically formulating queries from log anomalies detected by the Log Anomaly Detection module. The query, thus formed, is run against the metadata-based index, unlike the text-based index in text retrieval, resulting in superior performance, in terms of relevancy of the resolution documents retrieved. Through experiments on web application server applications deployed on the cloud, we show the efficacy of metadata-based retrieval, which not only returns targeted results as compared to text-based retrieval but also the relevant resolution document appear amongst the top 3 positions for 60% of the queries.