PolyGraph: System for dynamic reduction of false alerts in large-scale IT service delivery environments

Sangkyum Kim; Winnie Cheng; Shang Guo; Laura Luan; Daniela Rosu; Abhijit Bose

USENIX ATC 2011

Conference paper

15 Jun 2011

PolyGraph: System for dynamic reduction of false alerts in large-scale IT service delivery environments

Abstract

In order to avoid critical SLA violations, service providers use monitoring technology to automate the identification of relevant events in the performance of managed components and forward them as incident tickets to be resolved by system administrators (SAs) before a critical failure occurs. For optimal cost and performance, monitoring policies must be finely tuned to the behavior of the managed components, such that SAs are not engaged for investigation of false alerts. Existing approaches to tuning monitoring policy rely heavily on high skilled SA work, with high costs and long completion times. Polygraph is a novel architecture for automated tuning of monitoring policies towards reducing false alerts. Polygragh integrates multiple types of service management data into an active-learning approach to automated generation of new monitoring policies. SAs can only be involved in the verification of policies with low projected scores. Experiments with a trace of 60K monitoring events from a large IT service delivery infrastructure compare methods for threshold adjustment in alert policy predicates with respect to potential for false alert reduction. Select methods reduce false alerts by up to 50% compared to baseline methods.

Conference paper