Publication
AAAI 2025
Tutorial

AI Data Transparency: AI Data Transparency: The Past, the Present, and Beyond

Abstract

Data transparency has become a focal point for work on AI innovation, reproducibility, security, safety, copyright, and other societal impacts associated with AI systems. Data documentation standards, audits, policy and technical interventions have all emerged around this shared objective. This half day tutorial offers participants an overview of the landscape of AI data transparency. Firstly, we share the current state of practice in AI data transparency. We examine where transparency is most lacking, and what interventions, from audits to techniques, have been introduced to remedy this. This exploration will span studies across modalities, and the wider AI supply chain, from crawling to training. Secondly, we will outline and discuss the cutting edge of approaches in documentation of AI data practices. This will include a hands-on introduction to Croissant, a machine-readable metadata vocabulary that allows AI researchers and practitioners to find the data they need and use it in an interoperable way across major ML platforms and repositories. Thirdly, we will discuss the demand for documentation approaches to address diverse user needs, and research responding to the tension between this demand and the need for standardization. In this section, we will hold a brief demo tailoring AI data documentation to specific users and share an introduction to the Factsheets methodology for creating user-centric documentation. In all, we aim to offer participants an understanding of both the present status of data transparency techniques, practice, and demands, and an insight into emerging trends.