Publication
SSE 2024
Short paper

Efficient Incident Summarization in ITOps: Leveraging Entity-based Grouping

Abstract

Service management and monitoring is an essential part of the software industry. During an outage, it is essential to gather and present all the relevant information in form of incident, to the Site Reliability Manager (SRE) for correct diagnostics. However, as the incident data increases, it gets tricky for the SRE to find the relevant information at a glance. In this paper, we address this problem of summarizing an incident and presenting the important facts in a lucid manner for the SRE to consume at a quick glance. We group the information present in the incident and use Large Language Model (LLM) for summarizing each of these groups separately. The grouping is done based on the entities/resources which are being affected by the incident. We also address the issue of hallucination of LLM model for our given context and design our pipeline to minimize generation of any false symptoms or facts in the generated summary. Our proposed method is being incorporated as a real-time feature in one of the popular tool currently present in the market.