About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
NeurIPS 2023
Demo paper
Detection, Diagnosis and Remediation for IT Incidents powered by Generative AI
Abstract
The fast-increasing complexity of modern IT in multi cloud environments is bringing unprecedented management challenges to Site Reliability Engineers (SREs) to meet Service Level Objectives (SLOs) and keep systems up and running effectively. To put in perspective, an availability SLO of 99.99% allows for 4.3 minutes of downtime per month, hardly something that can be attained by simply reacting to incidents. In this demo, we introduce our approach to address this challenge by transforming ITOps from being reactive to becoming proactive by leveraging large language models and advanced AI capabilities. The main goal of our work is to automate as much as possible the implementation of resolutions for upcoming IT issues before they turn into outages. Our demo consists of four steps: (1) Issue Detection, where we have developed an unsupervised methodology for detecting issues via ensemble of various anomaly detectors. We compare our methods with the state-of-the-art techniques implemented in the Salesforce Merlion library. (2) Issue Diagnosis, where we have developed language model based log data representation, built an AI system for probable cause identification using novel causal analysis and reinforcement learning, complemented with LLM-based summarization techniques easing consumption of diagnosis results by SREs and by downstream issue resolution analytics. We compare our methods with the state-of-the-art techniques implemented in the Salesforce PyRCA library; (3) Action Recommendation, which leverages state-of-the-art generative AI techniques to produce actionable recommendations; (4) Automation, where action recommendation outputs are transformed into code that can be executed to resolve the incidents.