5 minute read

Can AI help to promote endangered Indigenous languages?

IBM Research and University of São Paulo are working with Indigenous people in Brazil to develop AI-powered writing tools to strengthen and promote languages at severe risk of decline.

IBM Research and University of São Paulo are working with Indigenous people in Brazil to develop AI-powered writing tools to strengthen and promote languages at severe risk of decline.

From about 1600 to 1800, Nheengatu was the lingua franca of the Amazon, one of 800 Indigenous languages spoken during the Portuguese colonization of Brazil. Linguistically, Nheengatu evolved as a way for Indigenous communities to communicate with the Portuguese — and each other.

Today, Nheengatu, which translates to "good language" in English, is down to about 20,000 speakers and is on UNESCO’s list of “severely endangered” languages, meaning it’s no longer being passed down to most children. And it’s not the only language with a diminishing number of speakers. Nearly half of the world’s 7,000 languages are expected to go extinct by 2100. To try and stave off their decline, the United Nations has declared 2022-2032 the International Decade of Indigenous Languages.

The internet and AI have contributed to the decline by reinforcing the use of English and other dominant languages, which today power search and other timesaving tools. But they also offer the potential for a renaissance. With enough linguistic data, large language models (LLM) could be used to document the world’s threatened languages. And by bringing these languages into the digital realm, LLMs could also encourage more people to learn and use them in everyday life.

Working with students at Brazil’s University of Campinas, IBM researchers recently built a prototype for an AI-powered writing assistant in Nheengatu. It’s a language that many students grew up hearing their parents and grandparents speak, but that most never learned to read or write.

The prototype can translate between Nheengatu, English, and Portuguese, as well as suggest and provide definitions for words and complete sentences. The team is also working on a spellchecker. The prototype can do all these things despite seeing just 7,000 examples of sentences translated between Nheengatu and English. (The underlying translation model, Facebook’s WMT19, by contrast, was pre-trained on half a billion sentence pairs).

“Our training data was a thousand times smaller, and what’s amazing is still it works,” said IBM’s Claudio Pinhanez, an expert in natural language processing who leads the project. “It learns the mapping of a new language to English with just a tiny amount of data.”

“Steering” LLMs toward Indigenous languages

Pinhanez set out a year ago with colleagues at IBM Research — Brazil, and Indigenous language experts at the University of São Paulo, to see if AI could help to promote Brazil’s endangered languages. Through a linguistics student at the State University of Campinas two hours away, they connected with other Nheengatu speakers and together launched the project under the auspices of C4AI, an AI-research lab run by IBM and University of São Paulo.

Nheengatu has its roots in Old Tupi, the language spoken by the Tupinambá people who dominated Brazil’s east coast when the Portuguese arrived in the 1500s. At first, Portuguese settlers adopted Nheengatu, but as they consolidated power, Nheengatu was outlawed. When Brazil became independent in 1822, the language was again banned.

Thousands of endangered languages share a similar story. Colonizing powers often impose their languages in government and schools, displacing the native language. Industrialization and urbanization typically deal a second blow. As countries grow wealthier, more people move to cities for jobs, adopting the dominant language.

The internet has only accelerated this trend. Most content online is in a few dozen languages, providing training material for automated spellcheckers, sentence completers, and more recently, large language models that can skillfully mimic human-created content. These timesaving tools have made dominant languages even more indispensable, rendering those without an internet presence virtually invisible. Anyone trying to document or renew interest in the thousands of languages at risk of disappearing must contend with digital data that’s limited or non-existent.

“For 98% of the languages in the world, we’re never going to have the data we have for a hundred languages,” said Pinhanez. “What we’re exploring in this project is how we can we take a huge translation model and steer it to do things in languages with limited amounts of data.”

Nheengatu owes its existence to colonial oppression, but it’s among the most visible of Brazil’s 200 remaining Indigenous languages. Researchers extracted sentence examples from a Nheengatu dictionary, Nheengatu pages on Wikipedia, and even a pamphlet on Covid-19 written during the pandemic.

Like Arabic, Nheengatu places important information like verb tense at the front of words. In Nheengatu, for example, the English phrase “he does” would be represented as “he esdo.” “This makes building word completion systems much harder because they have to "ignore" the prefixes to look up the word in a dictionary,” said Pinhanez.

The shortage of Nheengatu-English sentence-pairs posed another challenge. When LLMs have too few examples to learn from, they have trouble extrapolating knowledge learned during training to new situations. To reduce mistakes, the researchers have focused on improving the quality of their existing data, and by training the model slowly.

The Nheengatu translator grew out of exploratory work that Pinhanez, with his IBM colleagues and linguists from the University of São Paulo, did with a community that speaks Guarani Mbya, another Tupi derived language. That project was later put on hold amid a community debate about whether children should learn Guarani Mbya through traditional means or modern digital channels.

Keenly aware of the legacy of colonialism, the researchers have made a point of having their Indigenous collaborators lead and set the terms of the project. They discussed their methodology and ethical constraints in a recent paper at the International Joint Conference on Artificial Intelligence.

What’s next

IBM and the University of São Paulo plan to open source the AI tools they have helped develop so that other Indigenous communities can adapt them to their own languages, possibly with the support of software developers. “Our goal is to have a transfer process of this technology to Indigenous organizations or startups,” said Pinhanez. Open sourcing the data, which includes public sources of information, is more complicated. “This is an ongoing discussion, but we have pledged from the beginning not to use any of the data in other AI projects without the permission of the Indigenous communities,” he said. “Any project, including IBM projects, would need explicit consent.”

The question comes down to who owns the rights to a language. If the language is English, so many people speak it that, clearly, no one “owns” it. At the other extreme, if a language has just one speaker, that person could claim ownership. “But what happens in the middle?” said Pinhanez. “Our answer is to work with the community of speakers and their leaders, to understand what they consider reasonable and helpful for the survival of their language and culture.”

In the meantime, the Nheengatu translator will continue to evolve. Languages change, and AI tools must adapt with them. As the tools are used, and translated texts are fed back into the model (with the community’s permission), its performance can be expected to improve.

Researchers are looking at ways to add safeguards so that translations of sensitive third-party or copyrighted information won’t be incorporated. They are also exploring whether synthetic data, created with the help of Indigenous language experts, could be used to improve the model.

The project could go in several directions from here. One that researchers are currently pursuing is to work with middle and high school students in the Amazon, a seven-hour flight from São Paulo, where Nheengatu is now being taught. “Young people are the future,” said Pinhanez. “A language dies when young people stop using it.”