Publication
ACS Fall 2024
Talk

Dataset of Reticular Materials' Syntheses Automatically Created from PDFs by using LLMs

Abstract

Given industrial and scientific interest in reticular materials (MOFs, COFs, ZIFs and zeolites), the creation of datasets containing different synthesis methods for these materials could have a disruptive impact on accelerating discoveries. To the best of our knowledge, no significant dataset putting together methods of synthesis of these four kinds of reticular materials is reported in the literature, and no out-of-the-shelf natural language processing (NLP) technique can automatically extract details about the synthesis of these materials from documents without the need for exhaustive, manual annotations. Therefore, we developed a Knowledge Extraction Pipeline (KEP) that uses prompt engineering with LLMs to extract knowledge from unstructured data using very few annotated examples. The first step of the pipeline uses IBM DeepSearch to extract paragraphs from PDFs. In the second step, the paragraphs are classified as “relevant” and “not relevant” for a given task by using a LLM prompted with examples of both classes. The LLM classifies the input paragraphs using in-context learning provided by these examples. The third and last step extracts the critical knowledge from relevant paragraphs and instantiates a JSON object to store the extracted knowledge. This step also uses LLM that is prompted with examples of relevant paragraphs and their corresponding annotations following JSON template. KEP was applied to create a dataset by extracting synthesis protocols of reticular materials described in PDF files. We began using a string search to select only public and CC-BY documents describing such synthesis. Then, we manually extracted and classified as “relevant” 10 examples of paragraphs describing four types of syntheses (solvothermal, microwave-assisted solvothermal, sonochemical, and mechanochemical) and, as “not relevant”, 10 examples of other kinds of paragraphs. We used these 20 classified paragraphs in the prompt of KEP’s second step. In parallel, we developed JSON templates to describe the protocols of the four types of synthesis and used them to annotate the knowledge extracted from the 10 relevant paragraphs. These paragraphs and their corresponding JSON objects were used to create the second prompt in KEP’s third step. Once we had settled the pipeline, we used it to process all papers returned from our string search. We then created a dataset of JSON objects storing the protocols of reticular materials’ syntheses that we are making public in github.