Talk

Comparative Study of Open-source LLMs for text classification and knowledge extraction in the Material Discovery Domain

Abstract

Large language models (LLMs) have been used for performing a variety of natural language processing (NLP) tasks in different domains. This work compares the performance of several open-source LLMs that have not been trained or fine-tuned for any specific task in the Material Discovery domain. The comparison focuses on the text classification and knowledge extraction tasks while exploring two use cases: (i) synthesis protocols: extract from text details of the synthesis of reticular materials; and (ii) PFAS: extract from text the applications where PFAS are being used and the roles of their materials on those applications.

Three different prompt engineering techniques were used in the comparison of the models: zero-shot prompt, few-shot prompt and chain-of-thoughts (CoT) prompt. Zero-shot prompting means that the prompt don’t contain examples or demonstrations. Few-shot prompt enables in-contex-learning by providing in the prompt all the context needed by the LLM to process the instructions, i.e., by providing several examples that guide the LLM when producing the output. CoT prompting combined with few-shot prompting enables complex reasoning capabilities since, together with the examples, the prompt provides their reasoning steps. Besides varying the prompt technique, we have also selected models having different numbers of parameters and different precision data types.

Our experimental results indicate that (i) LLMs can achieve high performance with a limited set of examples in the prompt, even without training or fine-tuning the models for the domain; (ii) it is important to test different examples to be used in the prompt since this variation greatly influences the model performance; (iii) different LLMs may require different sets of examples to achieve their highest performance; (iv) a huge number of parameters in the model does not guarantee a good model performance in any task and (v) low-precision data types don’t imply in a significant variation of models’ performance.

Related