Rapidly falling costs of genome sequencing and the availability of commercial sequencing services have been major enablers for the ever-increasing adaptation of multi-omics for life sciences. Being a data-intensive discipline, modern life sciences are quickly becoming an intensely data-driven discipline, the success of which depends on tight interdisciplinary collaborations with computational sciences. At IBM Research – UK, our focus is on developing computational tools and methods that fall broadly under three main themes:
- Availability of scalable data-centric compute infrastructure,
- Distributed specialized bioinformatics workflows, and
- Algorithm development using AI and machine learning to derive novel insights from genomics and related heterogenous datasets.
We work in close collaboration with industrial and academic partners over a range of biological areas of interest, including bioinformatics, metagenomics, drug toxicity and plant genomics.
Tackling the data deluge in genomics through powerful, scalable infrastructure
Biology is a Big Data discipline, driven largely by the advancements in instrumentation that produce vast quantities of data. The key technologies are collectively known as “omics,” consisting of genomics, transcriptomics, proteomics, metabolomics and several imaging techniques. Researchers often use a combination — or variations — of omics techniques (“multi-omics”) to understand biological systems in a comprehensive manner. Each of these omics techniques can generate data of the order of hundreds of gigabytes per experiment. The data magnitude scales vastly in multi-omics studies.
Although the omics revolution provides us with a magnifier to examine biology at an ever-finer resolution, its success largely depends on the underlying data management techniques. By identifying the exact compute requirements for the scientific goals desired from genomics data, and innovating the necessary missing pieces, we at IBM Research are creating a framework for developing bioinformatics as a service (BaaS) using the container and pipelining technology deployed on elastic clouds along with the relevant IBM proprietary and open-source technologies. We expect our approach to enable highly scalable data processing pipelines that can handle extremely large datasets and perform data processing in parallel in significantly less time.
Example of distributed containerized bioinformatics workflow
Human health and food safety
Understanding the role of microbiota in human health and disease development.
Comprehensively analysing, comparing and clustering all genes in all organisms in a given complex microbiome sample.
Soil is the most biodiverse environment on Earth: Up to 10 billion bacterial cells are expected to reside in just one gram of soil. We have very little understanding of the microbial populations that are essential to maintaining soil health. Through a large-scale metagenomics analysis, we are trying to understand which microbial populations are present in different soil samples, and how their concentration affects the quality of soil. Furthermore, we are trying to understand soil as a living system, where the characteristics of soil are defined by the interplay between inhabitant species.
For this study, our effort is divided into three categories:
- metagenomic data processing,
- network biology of soil, and
- machine learning models for soil research.
Distributed data processing is of great significance in the case of metagenomics analysis because the volume of microbiological genomic data is several times greater than that of whole genome sequencing data. Through this project, we have been involved in porting many bioinformatics tools that can usually take days to process each experimental sample, but can now be run in parallel within a few hours. We have applied several network-based methods to understand differential functional patterns observed among diverse kinds of soil samples and events of horizontal gene transfer in soil-dwelling bacterial populations.
Example of functionally connected gene orthologs from soil metagenomics dataset
Human health and food safety
Metagenomics, the study of the genomic diversity of microbes, is increasingly used in food safety, environmental studies, human and animal health. Recent advances in high-throughput sequencing technologies have enabled the characterization and comparison of microbial communities in very diverse environments. One of the major research challenges is gaining insight into the function, structure and organisation of microbial communities. For example, characterising the composition and activity of metagenomes across different individuals (healthy and diseased subjects) is important to understand the role of microbiota in disease development. Can we predict from a gut or skin microbiome whether someone has a disease, a predisposition, or even how progressed the disease is?
To unlock this opportunity, we are creating an AI analytical framework linking microbiome composition to health in order to better comprehend the directions in which the microbiome should be manipulated. The approach is based on the use of data-driven machine learning to build predictive models from high-dimensional and sparse matrices summarising relative abundances of microbial taxa in samples. By analysing microbiome data across human, mouse, and environmental samples and applying RoDEO (Robust Differential Gene Expression), combined with machine learning methods, we have been able to predict phenotypes or traits of host organisms accurately. Our approach has the potential to facilitate disease diagnosis and improve the future personalisation of medicine.
Current taxonomic classification methods focus on sequencing specific marker genes, such as 16S rRNA, and rely on existing microbial reference databases, which are often incomplete. A more informative method is whole-metagenome shotgun sequencing, which generates huge collections of short reads. The need to analyse, assemble or align metagenomics reads makes whole-metagenome analysis both data and computation-intensive. We propose a new method for rapidly creating a compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time.
We are also working on applying artificial intelligence to build predictive models that can provide insights for specific cutting-edge applications such as for guiding diagnostics and developing personalised treatments. Current machine-learning workflows that predict traits of host organisms from their commensal microbiome do not take into account the entire genetic material constituting the microbiome, instead basing the analysis on specific marker genes. We are developing a machine-learning workflow that efficiently performs host phenotype prediction from entire shotgun metagenomes by computing similarity-preserving compact representations of genetic material using histoskeching. Our workflow enables prediction tasks, such as classification and regression, from terabytes of raw sequencing data that do not necessitate pre-processing through expensive bioinformatics pipelines.
Assessing the toxic potential of a molecule is an important aspect in molecule prioritisation for drug discovery. Gene expression datasets, usually obtained form microarray and RNA-SEQ platforms, are playing an increasingly important role in understanding toxicity caused by a molecule. This interdisciplinary endeavour, which applies traditional genomics techniques to understand toxicity, has spawned a subdiscipline of its own, known as toxicogenomics. With the arrival of new and cheap gene-expression profiling methods, toxicogenomics is set to grow at a rapid pace with a focus on discovering small molecule mechanism and annotating genetic variants.
In this project, we plan to understand the toxicity caused by a set of chemical compounds in the context of affected biological pathways and associated phenotypes. Our collaborators have produced a transcriptomics dataset under different perturbation conditions for a fixed number of chemical compounds, which needs to be analysed to answer two central questions:
- What is the mode of action for a chemical to be associated with a phenotype?
- Is there a way to predict the toxicity potential of a chemical by studying the experimental dataset in conjunction with existing large knowledge bases available in the public domain?
We are taking a highly interdisciplinary approach where knowledge from genomics (transcriptomics in particular) need to be combined with information available in the scientific literature using a variety of compute techniques ranging from traditional bioinformatics, network biology, to machine learning and natural language processing. Owing to the large and diverse volumes of data to be processed and searched, the challenge aims to leverage Big Data platforms for data storage and integration. We intend to bring together two approaches of AI — knowledge engineering and statistical learning — to work in conjunction with traditional bioinformatics in order to create a unified computational solution to understand toxicity as captured by the gene expression dataset. This project aims to demonstrate how Big Data, AI and traditional bioinformatics can be combined to tackle an important question in the life sciences domain.
Differentially expressed gene network derived from a gene expression dataset
Building core regulatory gene networks in hexaploid bread wheat
Wheat is the most prevalently cultivated grain for human consumption. To keep pace with the increase in human population and environmental changes, it is vital to increase and sustain the yield of the crop. There have been sustained efforts for the past couple of years to sequence the wheat genome in order to better understand the molecular mechanisms responsible for key agricultural traits. Although the availability of the recently completed bread wheat genome gives us an insight into its structure and compositional nature in terms of genes and transcripts, the biological functions arising from the expression of transcripts and the interaction of transcriptome networks remain largely unknown. Transcriptional networks drive the biology of an organism, so understanding the architecture of these networks is key to genetically design crops for the future. The wheat genome is about five times the size of the human genome and has more than 100,000 genes. In order to understand wheat as a system, one needs to understand the interplay between genes through a combination of experimental data, prior knowledge, as well as sophisticated data management and computational techniques.
In this project, we aim to create a computational framework around the large-scale transcriptomics datasets to understand key regulatory mechanisms in the wheat genome. In particular, we are focused on the biological process of circadian regulation that has been found to underpin many key agronomic traits, including flowering time, dormancy, water use efficiency, pathogen interaction, nitrogen metabolism and carbon partitioning. Given the complexity of the wheat genome and the lack of community-curated information, the methods developed and scientific insights gained from this project can lead to a better understanding of important biomarkers and molecular traits that can be leveraged by commercial plant breeders to develop improved plant varieties.
Hierarchical clustering derived from a temporal gene expression dataset
Graph-based representation of antimicrobial peptides
Antimicrobial peptides are a unique and diverse group of molecules, divided into subgroups based on their amino-acid composition and structure, that have been demonstrated to kill bacteria, viruses and fungi, and even transform cancerous cells. Today there is the need to discover new antimicrobial peptides as antimicrobial resistance is a threat to global health. High-throughput simulation, machine learning as well as data analysis and representation can help accelerate the discovery process. As a large amount of proteins and peptides sequences annotated with a range of information and properties are available in public databases (such as Uniprot, InterPro, CAMPR3, etc.) for analysis, we want to explore these datasets from a genomic perspective and cluster sequences that share some functionality.
In support of this activity, we are developing a k-mer based framework for clustering, graph representation and visualization of amino-acid sequences, more precisely antimicrobial peptides, based on their functionalities, properties and structural features. The tool can provide insights about the data by extracting antimicrobial signals from sequences and inspiration in the process of discovering novel antimicrobial peptides.
Ask the experts
Computational Genomics Group, IBM T.J. Watson Research Center
Cognitive and Cloud, Data-Centric Systems Solutions, IBM T.J. Watson Research Center
Healthcare and Life Sciences, IBM Research – Almaden
Science and Technology Research Council (STFC), United Kingdom
 L.-J. Gardiner et al.,
“Integrating genomic resources to present full gene and putative promoter capture probe sets for bread wheat,”
GigaScience, giz018, 2019.
 W.P. Rowe et al.,
“Streaming histogram sketching for rapid microbiome analytics,”
Microbiome 7(40), BioMed Central, 2019.
 L.-J. Gardiner et al.,
“Hidden variation in polyploid wheat drives local adaptation,”
Genome research 28(9), 1319–1332, 2018.
 L. Olohan et al.,
“A modified sequence capture approach allowing standard and methylation analyses of the same enriched genomic DNA sample,”
BMC genomics 19(1), 250, 2018.
 R. Krishna et al.,
“BaaS - Bioinformatics as a Service,”
Lecture Notes in Computer Science (LNCS, volume 11339), Euro-Par 2018 Parallel Processing Workshops, pp. 601-612, 2018.
 J. Turner et al.,
“The sequence of a male-specific genome region containing the sex determination switch in Aedes aegypti,”
Parasites & Vectors 11, 549, 2018.
 F. Cipcigan et al.,
“Accelerating molecular discovery through data and physical sciences: Applications to peptide-membrane interactions,”
The Journal of Chemical Physics 148(24), 2018.
 S. Grewal et al.,
“Comparative Mapping and Targeted-Capture Sequencing of the Gametocidal Loci in Aegilops sharonensis,”
The Plant Genome 10(2), 2017.
 I. Goodhead et al.,
“Large scale and significant expression from pseudogenes in Sodalis glossinidius — a facultative bacterial endosymbiont,”
bioRxiv, 124388, 2017.
 A.P. Carrieri et al.,
“Host Phenotype Prediction from Differentially Abundant Microbes Using RoDEO,”
In: A. Bracciali, G. Caravagna, D. Gilbert, R. Tagliaferri R. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics, CIBB 2016. Lecture Notes in Computer Science (LNCS, volume 10477), Springer, 2017.
 S.D. Armstrong et al.,
“Stage-specific proteomes from Onchocerca ochengi, sister species of the human river blindness parasite, uncover adaptations to a nodular lifestyle,”
Molecular & Cellular Proteomics 15(8), 2554–2575, 2016.
 L. Gardiner et al.,
“Mapping-by-sequencing in complex polyploid genomes using genic sequence capture: A case study to map yellow rust resistance in hexaploid wheat,”
Plant J. 87(4), 403–419, 2016.
 A.P. Carrieri et al.,
“Sampling ARG of multiple populations under complex configurations of subdivision and admixture,”
Bioinformatics 32(7), 1048–1056, 2016.
 R. Krishna et al.,
“A large-scale proteogenomics study of apicomplexan pathogens — Toxoplasma gondii and Neospora caninum,”
Proteomics 15(15), 2618–2628, 2015.
 N. Haiminen et al.,
“Comparative exomics of Phalaris cultivars under salt stress,”
BMC Genomics 15(6), 1–12, 2014.