A critical review of online social data: Biases, methodological pitfalls, and ethical boundaries
Abstract
Online social data like user-generated content, expressed or implicit relations among people, and behavioral traces are at the core of many popular web applications and platforms, driving the research agenda of researchers in both academia and industry. The promises of social data are many, including the understanding of “what the world thinks” about a social issue, brand, product, celebrity, or other entity, as well as enabling better decision-making in a variety of fields including public policy, healthcare, and economics. However, many academics and practitioners are increasingly warning against the naïve usage of social data. They highlight that there are biases and inaccuracies occurring at the source of the data, but also introduced during data processing pipeline; there are methodological limitations and pitfalls, as well as ethical boundaries and unexpected outcomes that are often overlooked. Such an overlook can lead to wrong or inappropriate results that can be consequential. This tutorial recognizes the rigor with which these issues are addressed by different researchers varies across a wide range, and aims to survey and categorize common classes of data biases and pitfalls that can occur both at the sources of social data as well as along the prototypical data processing pipeline.