Optimization of Genomics Analysis Pipeline for Scalable Performance in a Cloud Environment
Abstract
Cost-effective and scalable analysis of the human genome is crucial for the democratization of precision medicine. The new version of the Genome Analysis Toolkit (GATK4), an industry-standard end-to-end tool for variant discovery analysis in next-generation sequencing (NGS) data, introduces Apache Spark support to improve scaling for both local multithreading and cluster-wide parallelization, as well as facilitate the deployment on cloud infrastructures. In this paper, we evaluate the performance and scalability of GATK4-Spark running on a next-generation cloud platform. After identifying bottlenecks and scaling challenges, we optimize the software stack that includes an optimized JVM, enhancements of Spark and targeted configuration tuning, which in turn enables more effective use of the underlying computing resources. We demonstrate the effectiveness of our comprehensive optimization techniques on a reference Single Nucleotide Polymorphisms (SNPs) pipeline, achieving ≤1 hr computation time for whole human genome analysis.