Publication
AAAI 2025
Workshop

Preparing Good Data for Generative AI: Challenges and Approaches (Good-Data)

Abstract

Foundation models highly depend on the data they are trained on. While self-supervised learning is one of their promises, it is clear that the carefully processed datasets lead to better models. While datasets and models are frequently released by the community, the data preparation recipes are relatively nascent and not fully open. In this workshop, we invite contributions and collaborations in data preparation recipes for creating and using foundation models and generative AI applications, including (but not limited to) pre-training, alignment, fine tuning, and in-context learning. Data preparation spans data acquisition, cleaning, processing, mixtures, quality assessments, value of data, ablation studies, safety, and governance. This workshop emphasizes the responsible usage and ethical considerations of data preparation (including human annotations), to address the issues of diversity, bias, transparency, and privacy.