Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures
Abstract
Motivated by the high system complexity of today's datacenters, a large body of related studies tries to understand workloads and resource utilization in datacenters. However, there is little work on exploring unsuccessful job and task executions. In this paper, we study three types of unsuccessful executions in traces of a Google datacenter, namely fail, kill, and eviction. The objective of our analysis is to identify their resource waste, impacts on application performance, and root causes. We first quantitatively show their strong negative impact on CPU, RAM, and DISK usage and on task slowdown. We analyze patterns of unsuccessful jobs and tasks, particularly focusing on their interdependency. Moreover, we uncover their root causes by inspecting key workload and system attributes such as machine locality and concurrency level. Our results help in the design of low-latency and fault-tolerant big-data systems.