Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024
Cloud based microservice architecture has become a powerful mechanism in helping organizations to scale operations by accelerating the pace of change at minimal cost. With cloud based applications being accessed from diverse geographies, there is a need for round-the-clock monitoring of faults to prevent or to limit the impact of outages. Pinpointing source(s) of faults in cloud applications is a challenging problem due to complex interdependencies between applications, middleware, and hardware infrastructure all of which may be subject to frequent and dynamic updates. In this paper, we propose a light-weight fault localization technique, which can reduce human effort and dependency on domain knowledge for localizing observable operational faults. We model multivariate error-rate time series using minimal runtime logs to infer causal relationship among the golden signal errors (error rates) and micro-service errors to discover ranked list of possible faulty components. Our experimental results show that our system can localize operational faults with high accuracy (F1 = 88.4%) underscoring the effectiveness of using golden signal error rates in fault localization.
Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024
Deming Chen, Alaa Youssef, et al.
arXiv
Jose Manuel Bernabe' Murcia, Eduardo Canovas Martinez, et al.
MobiSec 2024
Sahil Suneja, Yufan Zhuang, et al.
ACM TOSEM