New research examines healthcare data and machine learning models routinely used in both research and application to address bias in healthcare AI.
Artificial intelligence keeps inching its way into more and more aspects of our life, greatly benefitting the world. But it can come with some strings attached, such as bias.
AI algorithms can both reflect and propagate bias, causing unintended harm. Especially in a field such as in healthcare.
Real-world data give us a way to understand how AI bias emerges, how to address it and what’s at stake. That’s what we have done in our recent study,1 focused on a clinical scenario, where AI systems are built on observational data taken during routine medical care. Often such data reflects underlying societal inequalities, in ways that are not always obvious – and this AI bias could have devastating results on patients’ wellbeing.
Our team of researchers, from IBM Research and Watson Health, have diverse backgrounds in medicine, computer science, machine learning, epidemiology, statistics, informatics and health equity research. The study — “Comparison of methods to reduce bias from clinical prediction models of postpartum depression,” recently published in JAMA Network Open — takes advantage of this interdisciplinary lineup to examine healthcare data and machine learning models routinely used in research and application.1
We analyzed postpartum depression (PPD) and mental health service use among a group of women who use Medicaid, a health coverage provider to many Americans. We evaluated the data and models for the presence of algorithmic bias, aiming to introduce and assess methods that could help reduce it, and found that bias could create serious disadvantages for racial and ethnic minorities.
We believe that our approach to detect, assess and reduce bias could be applied to many clinical prediction models before deployment, to help clinical researchers and practitioners use machine learning methods fairer and more effectively.
What is AI fairness, anyway?
Over the past decades, there has been a lot of work addressing AI bias. One landmark study recently showed how an algorithm, which was built to predict which patients with complex health needs would cost the health system more, disadvantaged Black patients due to unrecognized racial bias in interpreting the data.2 The algorithm used healthcare costs incurred by a patient as a proxy label to predict medical needs and provide additional care resources. While this might seem logical, it does not account for the fact that Black patients had lower costs at the same level of need as white patients in the data and were therefore missed by the algorithm.
To deal with the bias — to ‘debias’ an algorithm — researchers typically measure the level of fairness in AI predictions. Fairness is often defined with respect to the relationship between a sensitive attribute, such as a demographic characteristic like race or gender, and an outcome.3 Debiasing methods try to reduce or eliminate differences across groups or individuals defined by a sensitive attribute. IBM is leading this effort by creating AI Fairness 360, an open-source python toolkit that allows researchers to apply existing debiasing methods in their work.3
But applying these techniques is not trivial.
There is no consensus on how to measure fairness, or even on what fairness means, which is shown by conflicting and incompatible metrics of fairness. For example, should fairness be measured by comparing what the model predicted or the accuracy of the models? Also, in most cases it is not clearly known how and why outcomes differ by sensitive attributes like race. As a result, a great deal of prior work has been done using simulated data or by using simplified examples that do not reflect the complexity of real-world scenarios in healthcare.
There is no consensus on how to measure fairness, or even on what fairness means, which is shown by conflicting and incompatible metrics of fairness.
So we decided to use a real-world scenario instead. As researchers in healthcare and AI, we wanted to demonstrate how recent advances in fairness-aware machine learning approaches can be applied to clinical use cases so that people can learn and use those methods in practice.
Debiasing with Prejudice Remover and reweighing
PPD affects one in nine women in the US who give birth, and early detection has significant implications for maternal and child health. Incidence is higher among women with low socioeconomic status, such as Medicaid enrollees.5 Despite prior evidence indicating similar PPD rates across racial and ethnic groups, under-diagnosis and under-treatment has been observed among minorities on Medicaid. Varying rates of reported PPD reflect the complex dynamics of perceived stigma, cultural differences, patient-provider relationships and clinical needs in minority populations.
We focused on predicting postpartum depression and postpartum mental health care use among pregnant women in Medicaid. We used the IBM MarketScan Research Database, which contains a rich set of patient-level features for the study.
Our approach had two components. First, we assessed whether there was evidence of bias in the training data used to create the model. After accounting for demographic and clinical differences, we observed that white females were twice as likely as Black females to be diagnosed with PPD and were also more likely to use mental health services, post-partum.4
This result is in contrast to what is reported in medical literature — that the incidence of postpartum depressive symptoms is comparable or even higher among minority women — and possibly points to disparity in access, diagnosis and treatment.5
It means that unless there is a documented reason to believe that white females with similar clinical characteristics to Black females in this study population would be more susceptible to developing PPD, the observed difference in outcome is likely due to bias arising from underlying inequity. In other words, machine learning models built with this data for resource allocation will favor white women over Black women.
We then successfully reduced this bias by applying debiasing methods called reweighing and Prejudice Remover to our models through the AI Fairness 360 Toolkit . These methods mitigate bias by reducing the effect of race in prediction through weighting training data or modifying algorithm’s objective function.
We compared the two methods to the so-called Fairness Through Unawareness (FTU) method that simply removes race from the model. We quantified fairness using two different methods to overcome the limitations of imperfect metrics. We showed that the two debiasing methods resulted in models that would allocate more resources to Black females compared to the baseline or the FTU model.
As we’ve shown, clinical prediction models trained on potentially biased data could produce unfair outcomes for patients. In conducting our research we used the types of ML models increasingly applied to healthcare use cases, so our results should get both researchers and clinicians think about bias issues and ways to mitigate any possible bias before implementing AI algorithms in care.
- Park, Y. et al. Comparison of Methods to Reduce Bias From Clinical Prediction Models of Postpartum Depression. JAMA Netw Open 4, e213909 (2021).↩
- Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).↩
- Bellamy RKE, Mojsilovic A, Nagar S, et al. AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias. IBM J Res Dev 2019;63(4-5)↩
- Gress-Smith, J. L., Luecken, L. J., Lemery-Chalfant, K. & Howe, R. Postpartum Depression Prevalence and Impact on Infant Health, Weight, and Sleep in Low-Income and Ethnic Minority Women and Infants. Matern Child Health J 16, 887–893 (2011).↩
- Ko, J. Y., Rockhill, K. M., Tong, V. T., Morrow, B. & Farr, S. L. Trends in Postpartum Depressive Symptoms — 27 States, 2004, 2008, and 2012. MMWR Morb. Mortal. Wkly. Rep. 66, 153–158 (2017).↩