WOLFFI: A fault injection platform for learning AIOps models
Abstract
In today's IT environment with a growing number of costly outages, increasing complexity of the systems, and availability of massive operational data, there is a strengthening demand to effectively leverage Artificial Intelligence and Machine Learning (AI/ML) towards enhanced resiliency. In this paper, we present an automatic fault injection platform to enable and optimize the generation of data needed for building AI/ML models to support modern IT operations. The merits of our platform include the ease of use, the possibility to orchestrate complex fault scenarios and to optimize the data generation for the modeling task at hand. Specifically, we designed a fault injection service that (i) combines fault injection with data collection in a unified framework, (ii) supports hybrid and multi-cloud environments, and (iii) does not require programming skills for its use. Our current implementation covers the most common fault types both at the application and infrastructure levels. The platform also includes some AI capabilities. In particular, we demonstrate the interventional causal learning capability currently available in our platform. We show how our system is able to learn a model of error propagation in a micro-service application in a cloud environment (when the communication graph among micro-services is unknown and only logs are available) for use in subsequent applications such as fault localization.