C.A. Micchelli, W.L. Miranker
Journal of the ACM
The use of real world health data for Foundation Model training often comes with concerns due to the potential sharing of sensitive information. Synthetic data may prove to be one of the best assets to limit such concerns. In this manuscript, we introduce a new paradigm of training Foundation Models - generate synthetic data, encode it with a compression method and frequency-based mapping, and use these encoded data to align a Foundation Model. We demonstrate our pipeline on the task of colorectal cancer patient stratification into consensus molecular subtypes (CMS) using a decoder-only model. Evaluation of the aligned model on real data results in a balanced accuracy and F1 score of approximately 91%, competitive with baselines established by prior work leveraging real data as well as with models trained directly on synthetic data. Code to reproduce results is available at https://github.com/IBM/unified-lookup-tables.