IBM Functional Genomics Platform, A Cloud-Based Platform for Studying Microbial Life at Scale

View publication


The rapid growth in biological sequence data is revolutionizing our understanding of genotypic diversity and challenging conventional approaches to informatics. Due to the increasing available genomic data, traditional bioinformatic tools require substantial computational time and the creation of ever-larger indices each time a researcher seeks to gain insight from the data. To address this, we pre-computed important relationships between biological entities spanning the Central Dogma of Molecular Biology and captured this information in a relational database. The database can be queried across hundreds of millions of entities and returns results in a fraction of the time required by traditional methods. We describe IBM Functional Genomics Platform, a comprehensive database relating genotype to phenotype for bacterial life. Continually updated, the platform contains data derived from 200,000 curated, self-consistently assembled genomes. The database stores functional data for over 68 million genes, 52 million proteins, and 239 million domains with associated biological activity annotations from Gene Ontology, KEGG, MetaCyc, and Reactome. It maps the connections between each biological entity including the originating genome, gene, protein, and protein domain. We describe the data selection, the pipeline to create and update, and the developer tools.