ACS Fall 2023

Knowledge extraction pipeline with foundation models for material discovery


Material discovery processes require knowledge bases that can aid domain experts in deriving new hypotheses and managing experimental data and workflows. Unfortunately, the manual creation of knowledge bases is a labor-intensive, time-consuming and error-prone activity. Knowledge Extraction Pipeline (KEP) is a human-in-the-loop pipeline to semi-automatically extract knowledge from scientific literature without the need of exhaustive manual data annotations. KEP is based on the idea that knowledge is extracted from sentences classified as relevant in given document. It is composed of three tools. Sentence Selection tool obtains text information from PDFs and selects relevant sentences by using a Large Language Model (LLM). After the expert curation of these sentences, Knowledge Extraction tool extracts the desired knowledge by using the table creation from unstructured data use case provided by the LLM. The expert has again the opportunity to curate the extracted knowledge before Knowledge Representation tool creates RDF graph representing the knowledge obtained from relevant sentences. The pipeline was applied to find PFAS and their applications in PDFs. 15 relevant sentences mentioning PFASs and applications plus 15 not relevant sentences together with their classifications were provided as context to the LLM that was able to find out the relevant sentences of documents. Its accuracy was 85%. Next, the “table creation from unstructured data” LLM use case was used and 30 relevant sentences were annotated with tabular annotations highlighting PFASs and applications mentioned in each sentence. These sentences and their annotations were provided as context to LLM that was able to provide the same kind of annotation to the other relevant sentences. Its accuracy in this task was 86%. The RDF knowledge base of PFAS and applications was created by using the tabular annotations provided by LLM. The use case demonstrated that KEP extracts relevant knowledge without the need of extensive manual annotations.