About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICSLP 2006
Conference paper
Maximum entropy modeling for diacritization of Arabic text
Abstract
We propose a novel modeling framework for automatic diacritization of Arabic text. The framework is based on Markov modeling where each grapheme is modeled as a state emitting a diacritic (or none) from the diacritic space. This space is exactly defined using 13 diacritics1 and a null-diacritic and covers all the diacritics used in any Arabic text. The state emission probabilities are estimated using maximum entropy (MaxEnt) models. The diacritization process is formulated as a search problem where the most likely diacritization realization is assigned to a given sentence. We also propose a diacritization parse tree (DPT) for Arabic that allows joint representation of diacritics, graphemes, words, word contexts, morphologically analyzed units, syntactic (parse tree), semantic (parse tree), part-of-speech tags and possibly other information sources. The features used to train MaxEnt models are obtained from the DPT. In our evaluation we obtained 7.8% diacritization error rate (DER) and 17.3% word diacritization error rate (WDER) on a dialectal Arabic data using the proposed framework.