Publication
COMSNET 2024
Demo paper

A Chaos Recommendation Tool for Reliability Testing in Large-Scale Cloud-Native Systems

View publication

Abstract

With the proliferation of cloud-native systems supported by container technology and the widespread deployment of 5G and Edge use-cases, modern applications have become increasingly distributed and complex, often consisting of hundreds of components. Ensuring the reliability of these workloads has grown increasingly intricate as a consequence, only further complicated by the continuous evolution of systems supported by CI/CD practices. In this context, Chaos Engineering can play a crucial role in assessing the reliability of these large-scale systems by intentionally introducing adverse conditions and gauging their resilience in inter-connected environments. This controlled approach enables organizations to identify and learn from potential failure points before they escalate into full-blown service degradation and production outages. Yet, the effectiveness of chaos testing hinges on the relevance of the targeted fault scenarios and often relies on arbitrary or intuitive fault injection practices, leading to inefficiencies and suboptimal outcomes. Addressing these challenges, we have developed a chaos-recommendation tool. This tool assesses the real-time behavior and characteristics of workloads and suggests fault injections that can cause disruptions. In this demo, we will illustrate how the Chaos recommendation tool can be used to automatically identify potential failure points for a system and suggest corresponding chaos test cases. This tool, part of Redhat's Chaos Engineering project Kraken, is open-source and available at: https://github.com/redhat-chaos/krkn/blob/main/utils/chaos_recommender/README.md