Towards efficient resource management for data-analytic platforms

Claris Castillo; Mike Spreitzer; Malgorzata Steinder

doi:10.1109/INM.2011.5990676

IM 2011

Conference paper

19 Sep 2011

Towards efficient resource management for data-analytic platforms

View publication

Abstract

We present architectural and experimental work exploring the role of intermediate data handling in the performance of MapReduce workloads. Our findings show that: (a) certain jobs are more sensitive to disk cache size than others and (b) this sensitivity is mostly due to the local file I/O for the intermediate data. We also show that a small amount of memory is sufficient for the normal needs of map workers to hold their intermediate data until it is read. We introduce Hannibal, which exploits the modesty of that need in a simple and direct way - holding the intermediate data in application-level memory for precisely the needed time - to improve performance when the disk cache is stressed. We have implemented Hannibal and show through experimental evaluation that Hannibal can make MapReduce jobs run faster than Hadoop when little memory is available to the disk cache. This provides better performance insulation between concurrent jobs. © 2011 IEEE.

Conference paper