Improving reproducibility of data science pipelines through transparent provenance capture
Data science has become prevalent in a large variety of domains. Inherent in its practice is an exploratory, probing, and fact finding journey, which consists of the assembly, adaptation, and execution of complex data science pipelines. The trustworthiness of the results of such pipelines rests entirely on their ability to be reproduced with fidelity, which is difficult if pipelines are not documented or recorded minutely and consistently. This difficulty has led to a reproducibility crisis and presents a major obstacle to the safe adoption of the pipeline results in production environments. The crisis can be resolved if the provenance for each data science pipeline is captured transparently as pipelines are executed. However, due to the complexity of modern data science pipelines, transparently capturing sufficient provenance to allow for reproducibility is challenging. As a result, most existing systems require users to augment their code or use specific tools to capture provenance, which hinders productivity and results in a lack of adoption. In this paper, we present Ursprung,1 a transparent provenance collection system designed for data science environments.2 The Ursprung philosophy is to capture provenance and build lineage by integrating with the execution environment to automatically track static and runtime configuration parameters of data science pipelines. Rather than requiring data scientists to make changes to their code, Ursprung records basic provenance information from system-level sources and combines it with provenance from application-level sources (e.g., log files, stdout), which can be accessed and recorded through a domain-specific language. In our evaluation, we show that Ursprung is able to capture sufficient provenance for a variety of use cases and only adds an overhead of up to 4%.