Improved Sentence-Level Arabic Dialect Classification
Abstract
The paper presents work on improved sentence-level dialect classification of Egyptian Arabic (ARZ) vs. Modern Standard Arabic (MSA). Our approach is based on binary feature functions that can be implemented with a minimal amount of task-specific knowledge. We train a featurerich linear classifier based on a linear support-vector machine (linear SVM) approach. Our best system achieves an accuracy of 89.1%on the Arabic Online Commentary (AOC) dataset (Zaidan and Callison-Burch, 2011) using 10-fold stratified cross validation: A 1.3 % absolute accuracy improvement over the results published by (Zaidan and Callison-Burch, 2014). We also evaluate the classifier on dialect data from an additional data source. Here, we find that features which measure the informalness of a sentence actually decrease classification accuracy significantly.