MRS Fall Meeting 2023

In-silico Polymer Generation using fine-tuned Regression Transformers with Open Reference Data


In-silico design for homopolymers accelerated by machine learning and high throughput screening has garnered widespread attention and recognition in recent years. However, applications to specific domains are still challenging because data scarcity (typically structure-property pairs) not only limits model generalization but increases uncertainty in model inference. QSPR (Quantitative Structure-Property Relationship), a method widely used to aid data augmentation, is suitable for exploring unaddressed feature spaces. The method performs satisfactorily for intrinsic materials properties such as density, molar weight, and cohesive energy. However, for properties reflecting a material’s interaction with the chemical environment, such as refractive index or gas permeability, the QSPR method it less suited. In this contribution, we start materials discovery from tabulated information available in open-access literature. Furthermore, instead of training models from scratch, we rely on publicly available regression transformer models pretrained on large chemical datasets that we fine-tune on a small number of data points we extracted from literature. The models are tailored for finding new polymer membrane candidates with expected $CO_2$ and $N_2$ permeability. We have evaluated published reference data beyond their initial scope of application. To that end, we have focused on tabulated data with clear experimental condition stated. The semi-automated extraction processes include an OCR (Optical Character Recognition) step, the validation with ground truth, and the merge into a data table. Some of the documents include either rich tables [1] or XML raw format data [2] which greatly accelerates the process. To obtain pSMILES (SMILES for specified homopolymers), we have fetched information from online database and documents. In total, we have collected 160 pSMILES with $CO_2$ and $N_2$ permeability values under comparable experimental condition. The molecular weights of the collected species range from 40 to 1000 and the pSMILES string length ranges from 5 to 148. A regression transformer (RT) handles regression as a generative task by modeling continuous properties with a group of tokens representing digits and their orders of magnitude [3]. The RT algorithm used here exploits a custom pSMILES tokenizer and is available through GT4SD (Generative Toolkit 4 Scientific Discovery) [4]. Before fine tuning the model, both permeability values are converted to log scale for stabilizing model variance and minimizing skewed data distribution. 25 % of the dataset is used for model testing. Training the RT yields a dichotomous model that seamlessly transits between property prediction and conditional molecule design. As an example, a structure-property pair has the input format “&lt;pco2log&gt;2.255|*C(Cl)=C(*)CCCC”. We have evaluated the pretrained models with both single and multiple properties. Some generated polymer candidates such as *C(F)SC(*)[Si](C)(C)C and *c1c(Cl)cc(c2cc(C)c(N3C(=O)c4cc5c(cc4C3=O)C3(C)CCC5(C)c4cc5c(cc43)C(=O)N(*)C5=O)c3ccccc23)c2ccccc12 show high permeability values for $CO_2$ and are currently under investigation in automated molecular dynamics simulation. In this contribution, we have reported the generation of new homo-polymer candidates for gas separation applications which are currently being validated with gas filtration simulations. The advancement of natural language processing and machine learning has accelerated the automated scientific discovery process. The publicly available dataset and open-source algorithms will help researchers to reproduce our discovery results with minimum effort, and to expand the scope to other polymer applications. M. Songolzadeh et al.,<br/>[2] Benjamin Dhuiège et al.,<br/>[3] J. Born and M. Manica,<br/>[4] Manica, M., Born, J., Cadow, J. <i>et al.</i>