Systematic Discovery of Bias in Data
Abstract
Detecting bias in data is an integral component of trustworthy and responsible ML. For researchers and data scientists, investigating, detecting, and becoming aware of biases present in data is an important step to correcting and making better ML decisions. Bias exists in the form of subsets that deviate from global expectations. Typically, researchers begin with a set of pre-defined protected/sensitive attributes and use them as the basis upon which deviation from expectation is examined. For instance, a researcher may examine under- or over-representation of a particular gender or race and adjust ML models accordingly. While this works for most settings, it is suboptimal, because it does not cover the true scale of all possible enumerations of subsets in the data. In this paper, we argue for a different approach to bias discovery. Instead of performing stratification across a pre-defined set of features, we ask the more open-ended question - which subset has the highest deviation between observed and expected outcomes? To answer this question, we leverage subset scanning, which efficiently maximizes measures of divergence over exponentially many combinations of feature values. We demonstrate the capabilities and advantages of subset scanning over pre-defined stratification by analyzing scanning results on the Stanford Open Policing dataset. In so doing, we uncover anomalous subsets within the data which, to the best of our knowledge, have not been discovered before and show that it is impossible to uncover such anomalies by stratifying across a set of pre-defined features.