MRS Fall Meeting 2022
Conference paper

A Data-driven Discovery Pipeline for Flue Gas Separation Homopolymer Membranes


• Introduction: Flue gas generated by human activities is one of the most significant causes of global warming because a large portion of CO2 is emitted to the earth’s atmosphere yearly. To reduce the emission, separation of CO2 is a near-term goal where polymer membranes represent a cheap and environmentally friendly approach from their fabrication to their in-situ application. Furthermore, recent advancements in machine learning (ML) enable in-silico membrane design targeted toward finding more energy-efficient and robust in-situ performance candidates. Here, we present a discovery pipeline for data-driven polymer membrane design covering several phases: data preparation, ML model training, ML model sampling, generated candidates filtering and analysis. • Data preparation: Although an ever growing collection of datasets for small molecules are easing the process of building ML models for material design, the data coverage for polymers remains low, essentially because it’s challenging to collect high-quality data. In this work, we consider a benchmark polymer database (PI1M) with ~ 1M p-SMILES (polymer-SMILES) [1] generated in-silico starting from the PolyInfo [2] database. For generating properties associated to the p-SMILES, we adopted a Quantitative Structure-Property Relationships approach implemented in our Polymer Property Prediction engine [3]. By capturing the structural information such as topological variables, connectivity indices, and group contribution, we have been able to associate to the structures physical properties accurately in a high-throughput fashion. • Generative modeling: To generate alternative polymers, following the lead of Gòmez-Bombarelli et al. [4], we jointly trained a VAE (based on RNN cells) and an MLP on its latent space using p-SMILES representations of polymers and focusing on the following target properties: Tg, half-decomposition temp. (Tdh), and solubility. The learning process converged after 100 epochs (lr=1e-3 and batch size=64), and using gaussian processes it was possible to sample polymers with a solubility distribution matching the one observed for PI1M samples. The training experiments as well as the inference pipelines have been implemented using GT4SD [5]. • Filtering and validation In terms of generating innovative monomer candidates, physical validation times (from newly generated p-SMILES to CO2 permeability) are unfeasible. To overcome this limitation, we explored the chemical space of the generated p-SMILES using TMAP [6], since it allows to analyze candidates according to their local and global neighborhood in terms of structural similarity. A few candidates have been selected with their high Tg, Tdh, and a solubility value that falls into a range of interest. It is straightforward to validate the candidates using molecular dynamics simulations with software packages such as LAMMPS [7] or GROMACS [8]. Such a discovery pipeline can be easily automated and its results leveraged to establish a feedback loop to further fine-tune the model on different building blocks. • Conclusion Herein, we demonstrated how enriching PI1M data using simulations we can design a data-driven discovery pipeline relying on conditional generative models for homopolymer membranes. Given the generality of the components implemented the approach can be easily extended to co-polymer membrane discovery. [1] [2] [3] [4] Gómez-Bombarelli R. et al., ACS Central Science 2018 4 (2), 268-276. [5] [6] Probst, D., Reymond, JL. Visualization of very large high-dimensional data sets as minimum spanning trees. J Cheminform 12, 12 (2020). [7] [8]