Machine Translation

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

View publication


Unsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches have mostly been tried on high-resource European language pairs viz. English–French, English–German, etc. In this paper, we explore UNMT for 6 Indic language pairs viz., Hindi–Bengali, Hindi–Gujarati, Hindi–Marathi, Hindi–Malayalam, Hindi–Tamil, and Hindi–Telugu which are low-resource language pairs. We additionally perform experiments on 4 European language pairs viz., English–Czech, English–Estonian, English–Lithuanian, and English–Finnish. We observe that the lexical divergence within these language pairs plays a big role in the success of UNMT. In this context, we explore three approaches viz., (i) script conversion, (ii) unsupervised bilingual embedding-based initialization to bring the vocabulary of the two languages closer, and (iii) dictionary word substitution using a bilingual dictionary. We found that the script conversion using a simple rule-based system benefits language pairs that have high cognate overlap but use different scripts. We observe that script conversion combined with word substitution using a dictionary further improves the UNMT performance. We use a ground truth bilingual dictionary in our dictionary word substitution experiments, and such dictionaries can also be obtained using unsupervised bilingual embeddings. We empirically demonstrate that minimizing lexical divergence using simple heuristics leads to significant improvements in the BLEU score for both related and distant language pairs.