ACS Spring 2023

Unifying Molecular and Textual Representations via Multi-task Language Modelling


Neural language models have achieved impressive results in various natural language understanding and generation tasks. Recently, advances in language models have been successfully transferred to the chemical domain, proposing generative modeling solutions to classical problems from molecular design to synthesis planning. These new methods have shown potential for optimizing chemical laboratory operations, initiating a new era of data-driven automation in scientific discovery. However, despite these recent successes, specialized models for each chemical task are typically needed, requiring problem-specific fine-tuning and neglecting tasks' dependencies. However, the lack of a unified representation between the information expressed in natural language and chemical representations is the main limiting factor in the interaction between humans and the models. Inspired by recent advances in generative transfer learning, we explore a multi-task language model that can tackle a large variety of tasks in the chemical and natural language domains. We rely on mono-domain, frozen encoder models and jointly fine-tune a decoder on multiple domains. In doing so, we relieve the cross-domain training from computationally expensive, data-hungry pretraining, leveraging the power of language models trained on unstructured data. Furthermore, we apply multi-task learning to increase model expressivity and information sharing between modalities. In this way, our model handles chemical and natural language concurrently and can solve numerous chemical and natural language-based tasks using a single set of weights. We quantitatively evaluate our method against state-of-the-art baselines, exploring different strategies to adapt and fine-tune cross-domain language models. Our work paves the way for robust and efficient language models accelerating discovery in physical sciences. Our model, leveraging large, pre-trained single-domain models, can effectively solve language tasks and chemical tasks (forward reaction, retrosynthesis), cross-domain tasks (paragraph to molecule - text conditioned de novo generation; molecule to paragraph - molecular captioning), and language tasks (conditional text generation/completion).