Autoscaling for hadoop clusters
Anshul Gandhi, Sidhartha Thota, et al.
IC2E 2016
Problem diagnosis is one crucial aspect in the cloud operation that is becoming increasingly challenging. On the one hand, the volume of logs generated in today's cloud is overwhelmingly large. On the other hand, cloud architecture becomes more distributed and complex, which makes it more difficult to troubleshoot failures. In order to address these challenges, we have developed a tool, called LOGAN, that enables operators to quickly identify the log entries that potentially lead to the root cause of a problem. It constructs behavioral reference models from logs that represent the normal patterns. When problem occurs, our tool enables operators to inspect the divergence of current logs from the reference model and highlight logs likely to contain the hints to the root cause. To support these capabilities we have designed and developed several mechanisms. First, we developed log correlation algorithms using various IDs embedded in logs to help identify and isolate log entries that belong to the failed request. Second, we provide efficient log comparison to help understand the differences between different executions. Finally we designed mechanisms to highlight critical log entries that are likely to contain information pertaining to the root cause of the problem. We have implemented the proposed approach in a popular cloud management system, OpenStack, and through case studies, we demonstrate this tool can help operators perform problem diagnosis quickly and effectively.
Anshul Gandhi, Sidhartha Thota, et al.
IC2E 2016
Zelun Tony Zhang, Nick Von Felten, et al.
CHI 2026
Miriam Rateike, Brian Mboya, et al.
DLI 2025
Jung koo Kang
NeurIPS 2025