Predicting polymerization reactions via transfer learning using chemical language models
Abstract
Polymers have versatile properties and a wide range of applications. The optimization of polymeric materials and the development of new polymers are, however, time-consuming processes. Machine Learning (ML) techniques have been demonstrated to significantly accelerate the discovery process by predicting polymer properties or, more recently, by enabling the automated design and generation of new polymers with predefined target properties. Despite these advances, computational polymer discovery lacks automated analysis of reaction pathways and stability assessment through retro-synthesis. Currently, ML models do not exist for conducting retro-synthesis analysis on a range of copolymers, polymer blends, ladder, cross-linked, and metal-containing polymers. Previous research has predominantly focused on homo-polymers. Another critical issue is that ML models do not consider the influence of solvents, catalysts, and experimental conditions. In this work, we report the first extension of a transformer-based language model to polymerization reaction trained on a curated reaction dataset for vinyl polymers. We train the polymerization models for both forward and backward polymerization reactions prediction tasks, addressing both homo-polymers and co-polymers consisting of up to two monomers. Polymers are macromolecules which are formed by linking up smaller molecular units. Their synthesis typically involves various polymerization steps, with a multitude of possible links between monomer units. To address this issue during the ML training process we developed two distinct methodologies to assign the head and tail positions of the repeat units. We discuss the ML results, based on these two methodologies to assign the head and tail. Overall, we obtain a forward model Top-4 accuracy of 80% and a backward model Top-4 accuracy of 60%. We further analyze the model performance with representative polymerization examples and evaluate its prediction quality from a materials science perspective.