Better Bias Benchmarking of Language Models via Multi-factor Analysis
Abstract
Bias benchmarks are important ways to assess fairness and bias of language models (LMs), but the design methodology and metrics used in these benchmarks are typically ad hoc. We advocate that methods from health informatics for design and analysis of experiments (e.g. clinical trials) would facilitate understanding which potential biases are investigated by a benchmark and provide more insightful quantification and analysis of observed biases. Specifically, we propose an approach for multi-factor analysis of LM bias benchmarks. Given a benchmark, we first identify experimental factors of three types: domain factors that characterize the subject of the LM prompt, prompt factors that characterize how the prompt is formulated, and model factors that characterize the model and parameters used. We use coverage analysis to understand which biases the benchmark data examines with respect to these factors. We then use multi-factor analyses and metrics to understand the strengths and weakness of the LM on the benchmark. Prior benchmark analyses reached conclusions by comparing one to three factors at a time, typically using tables and heatmaps without principled metrics and tests that consider the effects of many factors. We propose examining how the interactions between factors contribute to bias and develop bias metrics across all subgroups using subgroup analysis approaches inspired by clinical trial and machine learning fairness research. We illustrate these proposed methods by demonstrating how they yield additional insights on the benchmark SocialStigmaQA. We discuss opportunities to create more effective, efficient, and reusable benchmarks with deeper insights by adopting more systematic multi-factor experimental design, analysis, and metrics.