SPARKBENCH: A comprehensive benchmarking suite for in memory data analytic platform spark
Abstract
Spark has been increasingly adopted by industries in recent years for big data analysis by providing a fault tolerant, scalable and easyto-use in memory abstraction. Moreover, the community has been actively developing a rich ecosystem around Spark, making it even more attractive. However, there is not yet a Spark specify benchmark existing in the literature to guide the development and cluster deployment of Spark to better fit resource demands of user applications. In this paper, we present SPARKBENCH, a Spark specific benchmarking suite, which includes a comprehensive set of applications. SPARKBENCH covers four main categories of applications, including machine learning, graph computation, SQL query and streaming applications. We also characterize the resource consumption, data flow and timing information of each application and evaluate the performance impact of a key configuration parameter to guide the design and optimization of Spark data analytic platform.