Investigating Genome Analysis Pipeline Performance on GATK with Cloud Object Storage
Abstract
Achieving fast, scalable, and cost-effective genome analytics is always important to open up a new frontier in biomedical and life science. Genome Analysis Toolkit (GATK), an industry-standard genome analysis tool, improves its scalability and performance by leveraging Spark and HDFS. Spark with HDFS has been a leading analytics platform in a past few years, however, the system cannot exploit full advantage of cloud elasticity in a recent modern cloud. In this paper we investigate performance characteristics of GATK using Spark with HDFS and identify scalability issues. Based on a quantitative analysis, we introduce a new approach to utilize Cloud Object Storage (COS) in GATK instead of HDFS, which can help decoupling compute and storage. We demonstrate how this approach can contribute to the improvement of the entire pipeline performance and cost saving. As a result, we demonstrate GATK with IBM COS can achieve up to 28% faster than GATK with HDFS. We also show that this approach can achieve up to 67 % cost saving in total, which includes the time for data loading and whole pipeline analysis.