Alice Driessen, Susane Unger, et al.
ISMB 2023
Transcriptomic foundation models (TFMs) aim to generalize across diverse single-cell datasets and have demonstrated utility in downstream tasks such as cell type annotation and perturbation prediction. These successes are often attributed to effective pretraining on gene expression prediction. However, emerging evidence suggests that mastery of the pretraining objective is not strongly related to strong downstream performance. Mastering the pretraining objectives would imply that the model can consistently differentiate between genuine gene expression profiles and random samples, with the same marginal statistics. Using scGPT as a test case, we evaluate this both at the regional level - by measuring reconstruction loss on interpolated samples within and between clusters, and at the sample level- by comparing loss and representation quality between real and shuffled data. We find that TFMs in their zero-shot state often fail to distinguish real from shuffled samples, despite the latter being synthetic, though continued pretraining can improve this discrimination. Notably, cell type clusters do not systematically exhibit lower pretraining error than transcriptomic states between clusters. Surprisingly, the quality of the representations obtained appears largely independent of the model’s ability to identify plausible expression profiles, suggesting that gene identity, rather than precise expression levels, drives representation learning. These findings highlight a disconnect between the pretraining objective and the representations learned, and point to new unsupervised evaluation strategies to better capture TFM quality and guide model development.