APS March Meeting 2023

Automatic, physical data extraction from scientific publications for application to generative molecular design in computational materials discovery

View publication


One of the major barriers for the application of artificial intelligence (AI) in materials design and discovery is the lack of training data for machine-learning models. Despite the recent emergence of public data repositories in materials sciences, the data formats are not standardized and searchability of application specific data sets is limited. This contrasts with the vast amounts of structured data tables available in published papers nowadays. In this contribution, we will present a method and research tool that allows the annotation and automatic extraction of physical and chemical data tables from document files. The necessary configuration steps include: (i) defining a corpus of papers which are relevant to the discovery application of interest; (ii) reviewing and selecting the extracted tables and converting the files, and (iii) transforming the materials’ names into a machine-readable string format. With the above steps completed, we obtain an integrated data table with materials properties that is used for training the AI models. In our research, we have used the above method to collect about 500 data entries with the following polymer properties: $ CO_2 $ permeability and $ CO_2 $ / $ N_2 $ selectivity. Currently, the amount of data entries we have extracted is limited by the number of documents in the corpus. Finally, we discuss our initial results obtained with AI models trained on the extracted data tables for designing high-performance membranes for carbon dioxide capture and separation.