Unsupervised learning of multilingual Short Message Service (SMS) dialect from noisy examples
Abstract
Noise in textual data such as those introduced by multilinguality, misspellings, abbreviations, deletions, phonetic spellings, non standard transliteration, etc pose considerable problems for text-mining. Such corruptions are very common in instant messenger (IM) and short message service (SMS) data and adversely affect off the shelf text mining methods. Most techniques address this problem by supervised methods. But they require labels that are very expensive and time consuming to obtain. While we do not champion unsupervised methods over supervised when quality of results is the supreme and singular concern, we demonstrate that unsupervised methods can provide cost effective results without the need for expensive human intervention to generate parallely labelled corpora. A generative model based unsupervised technique is presented that maps non-standard words to their corresponding conventional frequent form. A Hidden Markov Model (HMM) over subsequencized representation of words is used subject to a parameterization such that the training phase involves clustering over vectors and not the customary dynamic programming over sequences. A principled transformation of maximum likelihood based "central clustering" cost function into a "pairwise similarity" based clustering is proposed. This transformation makes it possible to apply "subsequence kernel" based methods that model delete and insert edit operations well. The novelty of this approach lies in that the expensive (Baum-Welch) iterations required for HMM, can be avoided through a careful factorization of the HMM Loglikelihood and in establishing the connection between information theoretic cost function and the kernel approach of machine learning. Anecdotal evidence of efficacy is provided on public and proprietary data.