Large Language Model Evaluation on Financial Benchmarks
Abstract
Although large language models (LLMs) have shown outstanding performance in natural language processing (NLP), there is a lack of widely adopted evaluation benchmarks in the finance domain. Such benchmarks are crucial to promote not only industrial application but also open source development of financial artificial intelligence (AI). This work focuses on benchmarking strategies in AI for Financial applications, specifically designed for evaluating LLMs in the English and Japanese finance domains. It emphasizes the use of domain-specific datasets and includes a variety of NLP tasks relevant to the financial industry. The proposed evaluation framework covers 12 publicly available English and Japanese finance datasets. The varying performance of 13 models across different financial tasks underscores the importance of selecting the appropriate model based on the specific requirements of each financial task. The code for the financial benchmark framework is available on GitHub.