ICML 2023
Conference paper

Unifying Molecular and Textual Representations via Multi-task Language Modelling


The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for clas- sical problems in molecular design and synthesis planning. These new methods have the poten- tial to optimize laboratory operations and fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each chemical task, lead- ing to the need for problem-specific fine-tuning and neglecting task interrelations. The main ob- stacle in this field is the lack of a unified rep- resentation between natural language and chem- ical representations, complicating and limiting human-machine interaction. Here, we propose a multi-domain, multitask language model to solve a wide range of tasks in both the chemical and natural language domains. By leveraging multi- task learning, our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task- specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art base- lines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which in- crease with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently acceler- ate discovery in physical sciences by supersed- ing problem-specific fine-tuning and enhancing human-model interactions.