About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
LREC 2002
Conference paper
Machine translation evaluation: N-grams to the rescue
Abstract
Human judges weigh many subtle aspects of translation quality. But human evaluations are very expensive. Developers of Machine Translation systems need to evaluate quality constantly. Automatic methods that approximate human judgment are therefore very useful. The main difficulty in automatic evaluation is that there are many correct translations that differ in choice and order of words. There is no single gold standard to compare a translation with. The closer a machine translation is to professional human translations, the better it is. We borrow precision and recall concepts from Information Retrieval to measure closeness. The precision measure is used on variablelength n-grams. Unigram matches between machine translation and the professional reference translations account for adequacy. Longer n-gram matches account for fluency. The n-gram precisions are aggregated across sentences and averaged. A multiplicative brevity penalty prevents cheating. The resulting metric correlates highly with human judgments of translation quality. This method is tested for robustness across language families and across the spectrum of translation quality. We discuss BLEU, an automatic method to evaluate translation quality that is cheap, fast, and good.