Managing training data from untrusted partners using self-generating policies
Abstract
When training data for machine learning is obtained from many different sources, not all of which may be trusted, it is difficult to determine which training data to accept and which to reject. A policy-based approach for data curation, where the policies are generated after examining the properties of the offered data, can provide a way to only accept selected data for creating a machine learning model. In this paper, we discuss the challenges associated with generating policies that can manage training data from different sources. An efficient policy generation scheme needs to determine the order in which information is received, must have an approach to determine the trustworthiness of each partner, must have an approach to decide how to quickly assess which data subset can add value to a complex model, and must address several other issues. After providing an overview of the challenges, we propose approaches to solve them and study the properties of those approaches.