Topology and redescriptions detect multiple alternative biological pathways from clinical phenotypes
Biological pathways play a crucial role in the properties of diseases and are important in drug discovery. Identifying the logical relationships among distinctive phenotypic clusters could reveal possible connections to the underlying pathways. However, this process is challenging since clinical phenotypes are often available through unstructured electronic health records. Moreover, in the absence of a standardized questionnaire, there could be bias among physicians toward selecting certain medical terms. In this article, we develop an efficient pipeline to address these challenges and help practitioners to reveal the pathways associated with the disease. We use topological data analysis and redescriptions and propose a pipeline of four phases: (1) pre-processing the clinical notes to extract the salient concepts, (2) constructing a feature space of the patients to characterize the extracted concepts, (3) leveraging the topological properties to distill the available knowledge and visualize the extracted features, and finally, (4) investigating the bias in the clinical notes of the selected features and identify possible pathways. Our experiments on a publicly available dataset of COVID-19 clinical notes testify that our pipeline can indeed extract meaningful pathways.