MRS Fall Meeting 2023

A transformer based large-scale molecular representation model


Large scale molecular representation methods have shown to be useful in several applications and areas of material science including virtual screening, drug discovery, chemical modeling, material design and molecular dynamics simulation. These representations prove to provide both effective and efficient analysis of molecular data. With the advancements in deep learning, several models have been developed to learn the representations directly from the molecular structures. Recently, transformer based molecular representations have gained significant importance in the field of material informatics. The importance of transformer-based molecular representation continues to grow as researchers explore their potential in advancing drug discovery, materials science, and other areas of molecular research. In this study, we develop one such transformer-based model that is capable of capturing complex relationships and interactions within molecules. While most of the existing works focus on only capturing the representations through encoder-only models, we present an encoder-decoder model based on BART (Bidirectional and Auto-Regressive Transformers) that is not only capable of efficiently learning the molecular representations but also auto-regressively generate molecules from the representations. This can prove to be highly impactful especially in cases of new molecule design and generation, enabling efficient and effective analysis and manipulation of the molecular data. The model is trained on a dataset of 10 billion molecules from the publicly available ZINC-22 database, rendering it the most extensive training dataset employed to date. The dataset is encoded to SELFIES (SELF-referencing Embedded Strings) representation as SELFIES provides a more concise and interpretable representation, making it suitable for machine learning applications where compactness and generalization are important. The encoded SELFIES are then tokenized using an efficient tokenization scheme with masking in order to improve generalizability. We show that the learned molecular representation outperforms existing baselines on downstream tasks, thus validating the efficacy of the large pre-trained model.