Substring-based unsupervised transliteration with phonetic and contextual knowledge

Anoop Kunchukuttan; Pushpak Bhattacharyya; Mitesh M. Khapra

doi:10.18653/v1/k16-1027

CoNLL 2016

Conference paper

11 Aug 2016

Substring-based unsupervised transliteration with phonetic and contextual knowledge

View publication

Abstract

We propose an unsupervised approach for substring-based transliteration which incorporates two new sources of knowledge in the learning process: (i) context by learning substring mappings, as opposed to single character mappings, and (ii) phonetic features which capture cross-lingual character similarity via prior distributions. Our approach is a two-stage iterative, boot-strapping solution, which vastly outperforms Ravi and Knight (2009)’s state-of-the-art unsupervised transliteration method and outperforms a rule-based baseline by up to 50% for top-1 accuracy on multiple language pairs. We show that substring-based models are superior to character-based models, and observe that their top-10 accuracy is comparable to the top-1 accuracy of supervised systems. Our method only requires a phonemic representation of the words. This is possible for many language-script combinations which have a high grapheme-to-phoneme correspondence e.g. scripts of Indian languages derived from the Brahmi script. Hence, Indian languages were the focus of our experiments. For other languages, a grapheme-to-phoneme converter would be required.

Conference paper