ACS Spring 2024

Lessons learned in Knowledge Extraction from Unstructured Data with LLMs for Material Discovery


A large language model (LLM) is a deep learning algorithm that can perform a variety of natural language processing (NLP) tasks such as text summarization and generation, sentiment analysis and knowledge extraction. Several approaches are using LLMs, instead of traditional NLP tools, to extract knowledge from unstructured data in the Material Discovery domain (MD) due to the limitations of such tools. NLP tools usually require massive data annotation, which is a labor-intensive, time-consuming, and error-prone activity, to train machine learning models on what should be extracted. Typically, every new task requires an extra annotation effort. In addition to these issues, NLP basic tasks, such as tokenization, face important challenges when dealing with domains where entities are named (or identified) by using special characters (, - ( ] ) (e.g. 1,1,2,2tetrafluoro-2-[(1,2,2-trifluoroethenyl)oxy]-,polymer). Prompt-engineering is used in LLMs to provide examples and instructions about the task to be executed without the need to annotate hundreds of data. We tested the use of prompts in LLMs to extract knowledge from paragraphs coming from papers and patents in 4 uses cases in MD. Prompts were composed of paragraphs followed by their annotations representing knowledge to be extracted. Lessons learned are: (i) since the size of the prompt is limited by the model and LLMs are based on statistic, the examples must be representative of the paragraphs found in the documents when considering their structure. Isomorphic paragraphs should be avoided; (ii) if different kinds of information will be extracted from different paragraphs, this variability must be represented in the selected paragraphs; (iii) the format used to describe the annotations influences the output. Use simple format or well-known one (e.g. json); (iv) the examples should be consistently annotated, i.e., two paragraphs having similar structure must have similar annotations and the same entity must be annotated equally when appearing in two paragraphs; and (v) if there is only one possible answer, set LLM’s temperature to 0.