Toward achieving operational excellence in a cloud

Salman Baset; Long Wang; Byungchul Tak; Cuong M. Pham; Chunqiang Tang

doi:10.1147/JRD.2014.2298927

IBM J. Res. Dev

Paper

01 Jan 2014

Toward achieving operational excellence in a cloud

View publication

Abstract

A cloud pools resources such as compute, network, and storage and delivers them quickly and automatically on-demand through software. In addition, it provides automatic and policy-driven management of resources through software. Such a system comprises many components, whose states change rapidly. To manage it effectively, cloud service providers need to clearly understand the behavior of operations across components, and be able to fix errors as early as possible. The task of building such capabilities (referred to as operational excellence) in a cloud system is challenging because components maintain internal state and interact in non-intuitive ways to perform automated operations. In this paper, we discuss the concept of operational excellence for a cloud system, discuss the challenges in achieving the operational excellence, and describe our vision. Toward our vision, we present a set of techniques to determine the causal sequences of system events across distributed components. We also model configured system states using casual sequences of system events, gather observed system states, and continuously verify the configured and observed states across system components. We apply these techniques to study OpenStack®, an open source infrastructure-as-a-service platform. © 2014 IBM.

Conference paper