The limits of abstract evaluation metrics: The case of hate speech detection

A. Olteanu; Kartik Talamadupula; Kush R. Varshney

doi:10.1145/3091478.3098871

WebSci 2017

Conference paper

25 Jun 2017

The limits of abstract evaluation metrics: The case of hate speech detection

View publication

Abstract

Wagstaff (2012) draws attention to the pervasiveness of abstract evaluation metrics that explicitly ignore or remove problem specifics. While such metrics allow practitioners to compare numbers across application domains, they offer limited insight into the impact of algorithmic decisions on humans and their perception of the algorithm's correctness. Even for problems that are mathematically the same, both the real-cost of (mathematically) identical errors, as well as their perceived-cost by users, may significantly vary according to the specifics of each problem domain, as well as of the user perceiving the result. While the real-cost of errors has been considered previously, little attention has been paid to the perceived-cost issue. We advocate for the inclusion of human-centered metrics that elicit error costs from humans from two perspectives: the nature of the error, and the user context. Focusing on hate speech detection on social media, we demonstrate that even when fixing the performance as measured by an abstract metric such as precision, user perception of correctness varies greatly depending on the nature of errors and user characteristics.

Conference paper