Using component interaction model and network traces for root-cause analysis
Root-cause analysis after a system failure/error is an important activity to determine exact reasons for failure/error. Most of the time, these error conditions cannot be reproduced or it is not feasible to run the system again using the exact same scenario. Therefore, execution trace log of various functions/components recorded during the event is essential for root cause analysis and debugging in a complex system. Source code level instrumentation for dynamic analysis provides accurate execution trace log. But it is difficult to use an instrumented system in production environments because of performance and system stability issues. In a distributed system, intercepted network messages can be analyzed to identify interactions between various components of the system. However, messages captured on network alone do not provide complete information because messages between components on same host would not appear on network. We present a new idea to construct interaction information among components of a distributed application using messages captured on network and an interaction model that is a set of rules and heuristics about component interaction. An interaction model is pre-built offline using profile information and static control flow graph of the system. Profiling is done with test data in a non production environment such as a test environment using 'close-to-real' test scenario. Messages corresponding to components interaction are captured on network to create a partial execution trace log. Then the trace log is completed using the pre-built interaction model.