Poster

Improving Transcriptomic Foundation Models Through Gene-Level Error Analysis and Adaptive Masking

Abstract

High-throughput transcriptomics enables large-scale measuring of gene expression patterns, and recently transcriptomic foundation models (TFMs)--inclduing scBERT, scGPT, and scFoundation--use masked prediction tasks to generate cell representations and to perform better on downstream tasks such as cell-type annotation and perturbation prediction. Current pretraining strategies treat the task of predicting gene expression levels as independent of the gene's identity. Here we introduce gene-level error metrics to evaluate model performance separately for each gene, finding significant and reproducible variation, finding that some genes are consistently harder or easier to predict than others. Interestingly, we discover that a gene's predictability can differ markedly between binary (presence/absence) and continuous (expression level) prediction tasks, revealing distinct aspects of gene regulation that models must capture. Biological interpretation of these patterns shows that highly co-regulated families such as the RP* (ribosomal protein) genes easier to predict, whereas transcription factors such as JUN and other regulators are significantly harder. We proceed to propose a difficulty-adaptive masking strategy that dynamically increases the masking probability of harder-to-predict genes based on validation error. Using the bmfm-rna framework (available at https://github.com/BiomedSciAI/biomed-multi-omic ), we implemented this idea and find that it allows the model to improve on genes with higher uncertainty without compromising overall validation loss, in comparison to uniform masking. Our results emphasize the value of domain-specific training metrics, which can inform modeling decisions and biological discovery by better capturing the complex regulatory relationships between genes.

Related