ISMB 2024

Improving Health-Status Classification for Epithelial Cells of Inflammatory Bowel Disease using a Transcriptomic Biomedical Foundation Model


Recent advances of high-throughput sequencing technology greatly contribute to genetic pathology, including the development of single-cell RNA sequencing (scRNA-seq) analysis. One recent translational application of scRNA-seq is to Inflammatory Bowel Disease (IBD) which is a chronic inflammation disorder of the gastrointestinal tract. While the etiology of IBD remains poorly understood, there is general agreement that multiple factors including genetic susceptibility, environmental, and microbial triggers contribute to the manifestation of IBD. Evaluating inflammatory state of tissue samples from IBD patients at a single-cell level by high-resolution gene expression is challenging yet holds promise for elucidating pathogenesis. To tackle this challenge, machine learning algorithms are effective for extracting essential information regarding inflammation status from high-dimensional large data. However, existing machine learning algorithms may fail in identifying the inflammation status in the case that only limited samples are available. Biomedical foundation models (BMFMs) with large-scale datasets have outperformed models trained from scratch which only used data for specific tasks. In this study, we sought to classify healthy, lesional, and non-lesional intestine cells of IBD patients and controls. We developed a pre-trained BMFM, which was trained on over 1 million cells in Panglao DB and re-trained using SCP259 dataset which contains 365 thousand cells from the colon mucosa of 18 IBD patients and 12 healthy individuals. We showed that our model outperformed XGBoost, particularly showing significant improvement when predicting cell types of limited samples. Our results suggest that pre-training with large data can aid in evaluating the status of cells even for such cases.