About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICSOC 2020
Workshop paper
Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals
Abstract
Cloud based microservice architecture has become a powerful mechanism in helping organizations to scale operations by accelerating the pace of change at minimal cost. With cloud based applications being accessed from diverse geographies, there is a need for round-the-clock monitoring of faults to prevent or to limit the impact of outages. Pinpointing source(s) of faults in cloud applications is a challenging problem due to complex interdependencies between applications, middleware, and hardware infrastructure all of which may be subject to frequent and dynamic updates. In this paper, we propose a light-weight fault localization technique, which can reduce human effort and dependency on domain knowledge for localizing observable operational faults. We model multivariate error-rate time series using minimal runtime logs to infer causal relationship among the golden signal errors (error rates) and micro-service errors to discover ranked list of possible faulty components. Our experimental results show that our system can localize operational faults with high accuracy (F1 = 88.4%) underscoring the effectiveness of using golden signal error rates in fault localization.