Discovering Proteins: Function to Name
Currently approximately half of all microbial proteins are tagged as putative or hypothetical proteins and lack functional annotation which leads to a reduced understanding of biological function at the genome-level and limits the classification of microorganisms especially pathogens. Here, we developed an approach to perform functional annotation of hypothetical proteins from over 50 million named proteins and 27K functional codes (InterProScan domain codes). We train 3 separate models for performing functional annotations at domain, family, and superfamily levels using Kraken. Furthermore, we construct a functional space to visualize these proteins and perform biological validation of results, while also enabling the discovery of potentially new proteins and their function. Most interestingly, this high dimensional functional space will facilitate the shift from genotype to phenotype for named proteins. Leveraging this space, we identify function-based clusters; if new clusters are formed due to improved annotation of hypothetical proteins, we will possibly uncover and understand evolutionary paths shared with known proteins. We use data from our Functional Genomics Platform for our work which has over 300K prokaryotic genomes, 75 million gene sequences, 55 million protein sequences, and over 260 million functional domains.