Creating Corpora for Seq2Seq Tone Rephrasing Using Social Media Posts

Paulo Cavalin; Marisa Vasconcelos; Marcelo Carpinete Grave; Claudio Pinhanez

doi:10.1109/IJCNN48605.2020.9207298

IJCNN 2020

Conference paper

01 Jul 2020

Creating Corpora for Seq2Seq Tone Rephrasing Using Social Media Posts

View publication

Abstract

We present a methodology to use Twitter posts to create a parallel corpus which can be used to train Seq2Seq neural networks for a tone rephrasing task. Given that people tend to post texts expressing opinions or emotions of varied intensities regarding given real-world events, the main idea is to create corpus containing pairs of posts with opposite tone but about the same topic. By doing so we overcome the main limitation of current tone rephrasing methods: the lack of appropriate parallel training corpora. We explore different methods to create the datasets, including some which require some level of manual labelling. The results show that a completely automatic generation from Twitter data yields training datasets which are better than those with manual interventions, and good enough for Seq2Seq models to outperform non-Seq2Seq models trained with similar data.

Conference paper