FineWeb-Conv: A Method for Finding Good Conversation Data
Abstract
In principle, large language models could talk more like humans naturally do if they are trained on data containing the interaction patterns of human conversation. However, one challenge to training a "conversation" model is that natural conversation data are relatively difficult to find. In this paper we demonstrate a method for annotating documents at scale with a 0-5 conversation score. We use a large language model to score a sample of documents for how conversational they are. Using the annotated samples, we trained Snowflake-arctic-embed with a classification head that outputs a single regression score from 0 to 5 for conversation rating. When converted to a binary classifier using a score threshold of 4, the model achieved a precision of 94\%. Our conversation score approach offers significant implications for data preparation in generative AI, particularly enhancing data annotation, filtering, and quality control.