EMNLP 2023
Conference paper

Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs


Techniques like Self-Instruct (Wang et al., 2023) or the follow-up Alpaca (Taori et al., 2023) propose the use of In-Context Learning (ICL) for data generation and result in strong conversational agents with little human super- vision. One limitation of these approaches is that they resort to very large language models (around 175B parameters) that are also proprietary and non-public. Here we explore the application of such techniques to language mod- els that are much smaller (around 10B–40B parameters) and have permissive licenses. We find the self-instruct approach to be less effective at these sizes and propose new ICL methods that draw on two main ideas: (a) categorization and simplification of the ICL templates to make prompt learning easier for the LM, and (b) ensembling over multiple LM outputs to help select high-quality synthetic examples. Our algorithm leverages the 175 self-instruct seed tasks and employs separate pipelines for instructions that require an input and instructions that do not. Empirical investigations with different LMs show that: (1) Our proposed method yields higher-quality instruction tuning data than Self-Instruct, (2) It improves performances of both vanilla and instruction-tuned LMs by significant margins, and (3) Smaller instruction-tuned LMs generate more useful examples than their larger un-tuned counterparts.