BLEU, a seminal benchmark for testing the efficacy of machine translation and other natural language generation tasks, was published 20 years ago this month.
Twenty years ago, a group of IBM Researchers came together to think of a better way of understanding the quality of translations done by computers.
In the early 2000s, the state of AI had advanced to the point where it became possible to use computers to translate large blocks of text into other languages. But even then, the only way to check that the translations were accurate was to have a person who spoke both languages read through the translations, which is not exactly quick or scalable.
DARPA, the Pentagon’s advanced research division, was interested in improving machine translation, and benchmarking the efficacy of various AI systems. At the 40th annual meeting of the Association for Computational Linguistics (ACL), in Philadelphia, in July 2002, IBM Researchers Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu presented their benchmark idea, called BLEU.
BLEU introduced a widely used automatic metric for measuring the performance of machine translation systems, which has since proven to be fundamental for the development of these technologies in the industry. The original paper released two decades ago now has close to 20,000 citations.
“I think BLEU is among the top-three greatest ideas in natural language processing,” said Kevin Crawford Knight, the chief scientist for natural language processing at Didi Global and longtime NLP researcher.
The researchers called the method BLEU, which stands for bilingual evaluation understudy — but it also happens to be the French translation of the word blue, IBM’s favorite color. The idea behind BLEU was to evaluate the error rate for machine translation — others had previously attempted to build something like this, but they relied primarily on single-word errors or insertions.
In language translation, the word order in one language might actually change from the original language, while still resulting in the same meaning. This led the group to explore n-grams, or the frequency of phrases, in sentences. Locally in a sentence, you have to make sense, but oftentimes, the overall phrases in a sentence can be moved around and still make sense. “Translation is translating an idea,” Roukos said. “Most possible expressions of an idea are good translations.”
By relying on the frequency of n-grams rather than individual words, BLEU could account for more flexibility in translations, and compare whether the machine translation was similar to how something a human had translated. BLEU scores translations on a range from 0 to 1, looking for what percentage of n-grams that are in the reference translation that appear in the machine translation — you want to match as many n-grams as possible.
The reason BLEU worked so well, according to Roukos, is because it evaluated an entire set of documents at once, rather than judging on a sentence-by-sentence basis. It also dramatically lowered the amount of time and cost it took to check on machine translations by reusing the same expensive reference material to rapidly assess the quality of a succession of candidate translation system improvements for a given language. With BLEU, it became fast and cheap to try out an experimental idea and see if that idea improved translation quality.
But BLEU was not an overnight success. “I first heard about BLEU when the IBM folks introduced it at a DARPA PI meeting,” Knight said. “I specifically remember there were two Q&A microphones in the room, because after the talk, we all got up and formed two giant lines: One by one, we dutifully denounced BLEU.”
“Whenever the field comes together to denounce an idea, it's usually a pretty good idea,” Knight added. “Most of us did a 180-degree turn because BLEU’s impact on our research was immediate and spectacular.”
"I remember when we heard about the crazy ideas about the BLEU score after it was presented at a DARPA meeting, and we all agreed that it was madness,” Philipp Koehn, a prominent machine translation researcher and professor at Johns Hopkins University, said. “The correlation plot with human judgment did at a lot of convincing, and the pure need to have a better way of evaluating than looking at, say, 100 sentences — or, even worse, cherry-picking examples.”
Although 20 years have passed since it was first unveiled, BLEU’s simplicity has allowed it to persist as the de-facto measurement for the field. The algorithm has also found many applications beyond translation — it’s routinely used to evaluate NLP algorithms that generate language in tasks such as abstractive summarization. “It’s now just part of the field,” Roukos said. There are even other metrics inspired by BLEU, such as ROUGE, and BLANC.
“Since then, BLEU has proven to be persistent, its shortcomings are known and we understand what it does, which is a clear advantage on more modern model-based metrics,” Koehn added. “I am still on Team BLEU.”