What is the minimal description that captures a space? Asking a mathematician’s basic question of a biological dataset reveals interesting answers about biology itself. This summarizes our underlying approach to subtyping hematological cancer.
Disease subtyping is a central tenet of precision medicine, and is the challenging task of identifying and classifying patients with similar presentations of a complex and intricate disease – which can help guide better and more informed treatment options for a given individual.
Today, a patient’s data can be collected from a multitude of perspectives (modes): genomic/DNA, transcriptomic/RNA, proteomic, histopathologic images, radiographic and other images, electronic medical records that include a plethora of readouts over time, and much more. Given the general state of our understanding of human diseases, more is indeed more, in terms of data modalities.
However, understanding how a certain data can help answer a specific question is an intriguing problem. Because most human diseases are complicated and heterogeneous, using data to accurately subtype a disease can open up a plethora of treatment options in a clinical setting. For example, performing a therapy with strong side effects could be justified if data could be used to predict the likelihood of a patient’s rapid decline without treatment.
Today, IBM Research and the Munich Leukemia Laboratory are publishing new research in PLOS Computational Biology that aims to subtype different hematological (blood) cancers based on omic data – or information surrounding the roles, relationships and actions of various types of molecules that make up the cells of an organism. In this case, we looked specifically at elements of the human genome, including DNA and dark matter DNA. We currently do not know anything at all about 50 percent of the human genome (very conservatively speaking) called the “dark matter” – similar to our very limited understanding of the dark matter of our universe.1
Since the tumor cells of origin for one type of cancer is the same, it makes the problem of molecular subtyping harder. We took our analysis further by asking the question whether DNA alone (not RNA or proteins) gave adequate information to subtype these closely related cancers. Our resulting discoveries resulted in two breakthroughs in this space:
- DNA alone contains enough signal to subtype blood cancers: DNA is considered the blueprint of the organism - it encodes genes and there are regions outside of genes which play direct or indirect roles in turning genes on and off.
- “Dark matter” DNA plays a much larger role than previously thought in influencing the phenotype of cells/tissues: Our research found that dark matter DNA alone is adequate in subtyping the cancer. This turns on its head the general belief that dark matter is largely outside the functional or any consequential realm, and proves that it deserves more study.
The off-the-shelf AI algorithms that we used for this problem were inadequate, underscoring the importance of domain-specific nuances in the statistical learning process. We designed a stochastic regularization AI model, specifically for DNA data, to address the confounding heterogeneity that exists in these datasets. In fact, this works well even for other phenotypes, including treatment responses (suggesting a molecular basis for those phenotypes).
Using the unique AI models we designed, coined ReVeal, we were able to achieve a 75 percent accuracy rate in identifying blood cancers using either non-dark DNA or dark matter DNA; compared to just a 35 percent accuracy rate achieved with standard AI methods.1
These results and the models we created lay the groundwork to continue exploring the significance of dark matter DNA further, in blood cancers – and potentially other types of cancers.