Foundation Models Evaluation

Overview

General-purpose language models have changed the world of natural language processing, if not the world itself. The evaluation of such versatile models, while supposedly similar to the evaluation of generation models before them, in fact presents a host of new evaluation challenges and opportunities. New LLMs are released every week, and understanding their performance is critical. As asserted by Hugging Face, the most popular platform to host models, datasets, and metrics, “With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art.”

With that in mind, our Foundation Model Evaluation framework (FM-eval) aims at validating and evaluating new large language models (LLMs) coming out of the IBM model factory, alongside open-source LLMs in a systematic, reproducible, and consistent way. FM-eval supports both fine-tuning and prompting (in-context learning) evaluation modes, while providing out-of-the box academic as well as business benchmarks. FM-eval evaluates models in a modular way, starting during model training, with a “basic” evaluation, to get a quick indication of the model status, followed by a more comprehensive evaluation (more datasets, more templates, more seeds), and finally a complete evaluation (including HHH, robustness, privacy, and more).

FM-eval is also designed to be flexible and allows easy addition of tasks, datasets, and metrics. To support this property, we developed Unitxt, an open-source Python library that provides a consistent interface and methodology for defining datasets, including the preprocessing required to convert raw datasets to the input required by LLMs, and the metrics used to evaluate the results.

The increasing versatility of LLMs has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs, extending to thousands of GPU hours per model. Thus, the evaluation team is looking into efficient evaluation, where the goal is to intelligently reduce the computation requirements of LLMs evaluation, while maintaining an adequate level of reliability.

With the limitation of the reference-based metrics (like rouge, bleu, etc.), the evaluation team is also working on developing new metrics that leverage another language model as an evaluator (LLM-as-Judge), a language-model-based metric, and other reference-less metrics.