Putting more knowledge at the fingertips of non-English speakers
IBM has launched a new open-source toolkit, PrimeQA, to spur progress in multilingual question-answering systems to make it easier for anyone to quickly find information on the web.
IBM has launched a new open-source toolkit, PrimeQA, to spur progress in multilingual question-answering systems to make it easier for anyone to quickly find information on the web.
Modern AI has made trivia masters of us all. Type a question into the search bar of your computer, and an answer comes back almost instantly.
A large deep learning model for language, known as a transformer, makes this feat of artificial intelligence possible. Transformers process a string of words in a sentence, figure out how they relate, and produce a short answer summarized from information extracted from trusted articles on the web.
Transformers have given speakers of Mandarin, English, and a handful of other European languages, easy access to the knowledge that humans have passed down through the ages in books, research papers, and other writing. But speakers of less-dominant languages have largely missed out.
IBM Research wants to change that with the release this week of its open-source software library, PrimeQA, designed to make it easier for anyone to quickly find information on the web. PrimeQA includes tools for reading and retrieving documents, and reading and answering questions embedded in spreadsheets, photos, and video. Like LEGO blocks, these tools are easy for researchers to use for their own purpose or to replicate the work of others without having to reinvent the wheel.
The larger goal is to make time-saving language tools available to speakers of non-Western languages, including those that are sliding toward extinction in our modern hyper-connected world. Many of these endangered languages can’t be found on the web and aren’t supported by consumer-facing AI applications like Siri or Alexa. Currently, no central repository that integrates question-answering algorithms exists; most of the top models are privately held, and the few that are open to public inspection are difficult to reuse or expand upon.
“We have no time to waste,” said Avi Sil, an IBM researcher who specializes in natural language processing. “If we pool our knowledge and create one repository for all question-answering problems, we can replicate and build on each other’s work that much faster.”
PrimeQA is based on a pair of IBM question-answering models, Dr. Decr,1 and Go Ahead Ask Me Anything (GAAMA) which can field and respond to questions in seven languages often overlooked in AI development, including Hindi and Korean, and handle basic questions in 90 other languages.
The development of both models was led by Sil, a computer scientist fluent in Hindi and Bengali who knows firsthand the frustration of not being able to quickly find information on the web. Though both languages are spoken by millions of people in Southeast Asia, neither has been incorporated into most language models because of a shortage of text-based training examples. Thousands of other languages are in a similar predicament.
PrimeQA, Sil hopes, will change that. “We want to make sure everyone has easy access to information,” he said. This week, Sil will unveil his multilingual question-answering system, Dr. Decr, at NAACL, a top conference for natural language processing. He will also discuss PrimeQA at a workshop aimed at expanding access to information in non-English languages.
Feed Watson Assistant five to 10 examples of a task in a new language, and it can now learn to identify the speaker’s intent well enough to deliver the relevant information.
The ability to analyze, translate, and generate text with large language models has improved dramatically in recent years. Christopher Potts, an NLP researcher at Stanford University, attributes much of the progress to the free sharing of code and data. But in the subfield of question-answering, much of the code is still behind closed doors. Without it, researchers are unable to replicate and build on each other’s work, leading to duplicated efforts and stalled progress.
“We used to spend a lot of time and energy reimplementing other groups' proposals, whereas now the code is usually there for us already,” said Potts. “PrimeQA takes us one step further by expanding access to question answering models. PrimeQA will benefit researchers as well as people looking for practical solutions.”
PrimeQA makes the code behind Dr. Decr, one of the top-performing multilingual language models on the University of Washington’s XOR TyDi leaderboard, available to anyone with a computer. Unlike most of its competitor models, Dr. Decr responds to questions in the language in which it was asked. The model doesn’t need to translate question into English first. Eliminating that extra step saves computation time, said Sil. It also allows students and other researchers to bypass language-translation services that cost money.
The model has been fine-tuned on seven typologically different languages: Arabic, Bengali, Finnish, Japanese, Korean, Russian, and Telugu. For Sil, it has meant getting fast access to facts related to his favorite Bengali movies, information that’s not available on Wikipedia pages in English.
But there’s more than movie trivia at stake. Companies are looking to reach new customers through expanded AI language tools. IBM recently added a universal language feature to Watson Assistant, allowing clients to train a chatbot in languages other than English and the 12 others that Watson currently supports.
Feed Watson Assistant five to 10 examples of a task in a new language, and it can now learn to identify the speaker’s intent well enough to deliver the relevant information, like where to get vaccinated or how to open a bank account. In a separate NAACL paper, IBM researchers showed that the new feature performed as well as Watson’s top competitors when both accuracy and training time were considered together.
“To make the development of chatbot interactive, we need near-instant training times,” the study’s senior author, Saloni Potdar, a researcher at IBM Watson Assistant AI, said. “This is not possible on CPUs with deep learning algorithms which require hours to train.”
This is just one part of the research IBM is working on for practical language applications. This week, we’re releasing two other open-source tools: Label Sleuth, which allows users with no machine-learning knowledge to build a customized text-classifier; and a text-based method for attacking stock prediction models via Twitter to help investors find and fix the security holes in their own forecasting models.
References
-
Li, Y. et al. Learning Cross-Lingual IR from an English Retriever. in 4428–4436 (Association for Computational Linguistics, 2022). doi:10.18653/v1/2022.naacl-main.329. ↩