Publication
COLM 2024
Conference paper

Large Language Model Routing with Benchmark Datasets

Abstract

The number of open-source Large Language Models (LLMs) grows daily, as does the number of available benchmark datasets used to evaluate LLMs. While some models dominate these benchmarks, no single model achieves the best accuracy in all tasks and use cases. In light of this observation, we address the challenge of selecting the best LLM from a collection of pre-trained models, given a new task. While related work relies on evaluating each candidate model on a set of labeled examples, our new formulation does not assume any labeled data from the new task is available. Instead, we repurpose a collection of benchmark datasets---which may focus on different tasks than the one at hand---to learn a ''router'' model for LLM selection from inputs only; this problem reduces to a collection of binary classification tasks. Empirically, our strategy consistently improves performance over using any single model for all tasks.