Combining labeled datasets for sentiment analysis from different domains based on dataset similarity to predict electors sentiment
The use of social media data to mine opinions during elections has emerged as an alternative to traditional election polls. However, relying on social media data in electoral scenarios comes with a number of challenges, such as tackling sentences with domain specific terms, texts full of hate speech, noisy, informal vocabulary, sarcasm and irony. Also, in Twitter, for instance, loss of context may occur due to the imposed limit of characters to the posts. Furthermore, prediction tasks that use machine learning require labeled datasets and it is not trivial to reliably annotate them during the short period of campaigns. Motivated by the aforementioned issues, we investigate if it is possible to use or mix curated datasets from other domains as a starting point to opinion mining tasks during elections. To avoid introducing a knowledge from the other domains that could end up by disturbing the task, we propose to use similarity metrics that point out whether or not the dataset should be used. In our approach, we conduct a case study using the 2018 Brazilian Presidential Elections and labeled datasets for sentiment analysis from other domains. To identify the similarity between the datasets, we use the Jaccard distance and a metric based on word embeddings. Our experimental results show that taking into account the (dis) similarity between different domains, it is possible to achieve results closer to the ones that would be achieved with classifiers trained with annotated datasets of the electoral domain.