Talk

Open-source large-scale foundation model for chemistry

Abstract

Large-scale pre-training methodologies for chemical language models have revolutionized the field of cheminformatics, offering significant advancements in tasks such as molecular property prediction and molecule generation. These models leverage the power of self-supervised learning to derive contextualized representations of input tokens by training on a large, unlabeled molecular datasets. Typically, the training process consists of two stages: pre-training on large, unannotated chemical corpora followed by fine-tuning on domain-specific tasks. This approach reduces the reliance on costly annotated datasets and enhances the model's capacity to generalize across a broader spectrum of chemical representations.

Here, we introduce a novel family of large-scale encoder-decoder chemical foundation models, pre-trained on a curated dataset of 91 million SMILES samples extracted and curated from PubChem. This dataset encompasses approximately 4 billion molecular tokens, allowing the model to capture an extensive range of chemical diversity. The pre-training strategy we employ focuses on maximizing the model's ability to encode structural and functional aspects of molecules, ensuring that it can generalize effectively to a wide range of downstream tasks. We present two main variants of the model: a base version with 289 million parameters and a Mixture-of-Experts version with 8×289M parameters, providing flexibility for different use cases. These models were evaluated across multiple benchmark datasets and demonstrated state-of-the-art performance in a range of tasks, including quantum properties and reaction yield predictions.

A key aspect of this work is the exploration of the model's latent embedding space. We present a preliminary assessment of its compositionality, a critical feature for reasoning-based tasks. The latent space demonstrates improved separability compared to state-of-the-art models, facilitating few-shot learning scenarios where minimal training data is available. This capability is especially valuable in chemical research, where adaptability and rapid learning from small datasets are essential. To support the wider research community, we are releasing the model weights on Hugging Face: https://huggingface.co/ibm/materials.smi-ted. Additionally, the codebase is available at the GitHub: https://github.com/IBM/material.

Related