Toward achieving operational excellence in a cloud
Abstract
A cloud pools resources such as compute, network, and storage and delivers them quickly and automatically on-demand through software. In addition, it provides automatic and policy-driven management of resources through software. Such a system comprises many components, whose states change rapidly. To manage it effectively, cloud service providers need to clearly understand the behavior of operations across components, and be able to fix errors as early as possible. The task of building such capabilities (referred to as operational excellence) in a cloud system is challenging because components maintain internal state and interact in non-intuitive ways to perform automated operations. In this paper, we discuss the concept of operational excellence for a cloud system, discuss the challenges in achieving the operational excellence, and describe our vision. Toward our vision, we present a set of techniques to determine the causal sequences of system events across distributed components. We also model configured system states using casual sequences of system events, gather observed system states, and continuously verify the configured and observed states across system components. We apply these techniques to study OpenStack®, an open source infrastructure-as-a-service platform. © 2014 IBM.