Aligning foundation models on encoded synthetic omic data for patient stratification

Nikita Janakarajan; Antonio Foncubierta; Matteo Manica

ICDH 2025

Short paper

07 Jul 2025

Aligning foundation models on encoded synthetic omic data for patient stratification

Abstract

The use of real world health data for Foundation Model training often comes with concerns due to the potential sharing of sensitive information. Synthetic data may prove to be one of the best assets to limit such concerns. In this manuscript, we introduce a new paradigm of training Foundation Models - generate synthetic data, encode it with a compression method and frequency-based mapping, and use these encoded data to align a Foundation Model. We demonstrate our pipeline on the task of colorectal cancer patient stratification into consensus molecular subtypes (CMS) using a decoder-only model. Evaluation of the aligned model on real data results in a balanced accuracy and F1 score of approximately 91%, competitive with baselines established by prior work leveraging real data as well as with models trained directly on synthetic data. Code to reproduce results is available at https://github.com/IBM/unified-lookup-tables.

Conference paper