Semi-supervised identification of SARS-CoV-2 molecular targets
SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pandemic and aid public health. In this work, we analyzed a corpus of 66,000 SARS-CoV-2 genome sequences. We developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that differentiates itself by not relying on use of a single reference genome and by overcoming atypical genome traits. Using this method, we identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction compared to proteome references including Replicase polyprotein 1ab (with its transcriptional slippage site). Compared to other published tools such as Prokka (base) and VAPiD, we yielded an 6.4- and 1.8-fold increase in protein annotations. Our method generated 13,000,000 molecular target sequences— some conserved across time and geography while others represent emerging variants. We observed 3,362 non-redundant sequences per protein on average within this corpus and describe key D614G and N501Y variants spatiotemporally. For spike glycoprotein domains, we achieved greater than 97.9% sequence identity to references and characterized Receptor Binding Domain variants. Here, we comprehensively present the molecular targets to refine biomedical interventions for SARS-CoV-2 with a scalable high-accuracy method to analyze newly sequenced infections.