Identifying Putative Gene Markers: A Biomedical Foundation Model-based Approach for Cell Type Annotation
Abstract
Single-cell RNA sequencing offers novel insights into gene expression patterns but is computationally intensive and scales poorly. Accurate cell type annotation is one barrier to the scRNAseq analysis of complex tissues and disease states. Here, we describe a novel “in-silico” approach that utilizes Biomedical Foundation Models (BMFM) pre-trained on millions of single cell transcriptomics data for cell type annotation. We developed a pre-trained Biomedical Foundation Model based on scBERT (a variant of BERT) and fine-tuned it using a dataset from human colon mucosa of 18 ulcerative colitis and 12 healthy subjects (SCP259, 6,000 genes across 51 cell types). We applied the layer integrated gradients interpretation method (https://captum.ai/docs/introduction.html) to calculate attribution scores of genes in predicting a cell type using both the pre-trained and fine-tuned models. We then derived a “learning” ratio of the fine-tuned versus pre-trained model to quantify transfer learning and to highlight gene-cell type pairs for potential gene markers. Our results demonstrate that overall fine-tuning yielded higher attribution scores compared to pre-training. We observed high learning ratio for known gene markers, for example for goblet cells, FCGBP (8.8) followed by MUC2 (8.4) and ZG16 (15.4). On the other hand, whereas the attribution score for REP15 was relatively low in the fine-tuned model, it had the largest learning ratio (19.8), suggesting it may be a putative marker of cellular dysfunction in colon. Our transfer learning approach has the potential to identify putative genes across various cell types and is applicable to numerous diseases.