Investigating Genome Analysis Pipeline Performance on GATK with Cloud Object Storage

Tatsuhiro Chiba; Takeshi Yoshimura

doi:10.1109/MASCOTS50786.2020.9285945

MASCOTS 2020

Conference paper

17 Nov 2020

Investigating Genome Analysis Pipeline Performance on GATK with Cloud Object Storage

View publication

Abstract

Achieving fast, scalable, and cost-effective genome analytics is always important to open up a new frontier in biomedical and life science. Genome Analysis Toolkit (GATK), an industry-standard genome analysis tool, improves its scalability and performance by leveraging Spark and HDFS. Spark with HDFS has been a leading analytics platform in a past few years, however, the system cannot exploit full advantage of cloud elasticity in a recent modern cloud. In this paper we investigate performance characteristics of GATK using Spark with HDFS and identify scalability issues. Based on a quantitative analysis, we introduce a new approach to utilize Cloud Object Storage (COS) in GATK instead of HDFS, which can help decoupling compute and storage. We demonstrate how this approach can contribute to the improvement of the entire pipeline performance and cost saving. As a result, we demonstrate GATK with IBM COS can achieve up to 28% faster than GATK with HDFS. We also show that this approach can achieve up to 67 % cost saving in total, which includes the time for data loading and whole pipeline analysis.

Conference paper