Visualizing jobs with shared resources in distributed environments
Abstract
In this paper we describe a visualization system that shows the behavior of jobs in large, distributed computing clusters. The system has been in use for two years, and is sufficiently generic to be applied in two quite different domains: a Hadoop MapReduce environment and the Watson DeepQA DUCC cluster. Scalable and flexible data processing systems typically run hundreds or more of simultaneous jobs. The creation, termination, expansion and contraction of these jobs can be very dynamic and transient, and it is difficult to understand this behavior without showing its evolution over time. While traditional monitoring tools typically show either snapshots of the current load balancing or aggregate trends over time, our new visualization technique shows the behavior of each of the jobs over time in the context of the cluster, and in either a real-time or post-mortem view. Its new algorithm runs in realtime mode and can make retroactive adjustments to produce smooth layouts. Moreover, our system allows users to drill down to see details about individual jobs. The visualization has been proven useful for administrators to see the overall occupancy, trends and job allocations in the cluster, and for users to spot errors or to monitor how many resources are given to their jobs. © 2013 IEEE.