Provenance in Context of Hadoop as a Service (HaaS)-State of the Art and Research Directions

Himanshu Gupta; Sameep Mehta; Sandeep Hans; Bapi Chatterjee; Pranay Lohia; C. Rajmohan

doi:10.1109/CLOUD.2017.91

CLOUD 2017

Conference paper

08 Sep 2017

Provenance in Context of Hadoop as a Service (HaaS)-State of the Art and Research Directions

View publication

Abstract

Hadoop as a service (HaaS), also known as Hadoop in the cloud, is a big data analytics framework that stores and analyzes data in the cloud using Hadoop/Spark. In this paper, we discuss the importance of providing provenance capabilities in context of Hadoop as a service (HaaS) framework. We first review the state of the art in provenance tracking in context of databases and work-flow processing, in context of cloud and in context of big data analytics frameworks like Hadoop and Spark. We next identify a number of provenance capabilities which have been developed in context of databases and workflow processing but the corresponding solutions have not been developed in context of Hadoop or Spark. We argue that developing these solutions is important so that a comprehensive provenance aware Hadoop as a Service (HaaS) can be provided on cloud. The paper ends by identifying some research challenges in developing these provenance capabilities.

Conference paper