Research
5 minute read

Teaching AI models to improve themselves

How self-specialization and deductive closure training can improve the accuracy of language models without weighing them down.

IBM Research scientists are working on different approaches to teach AI models how to improve themselves

How self-specialization and deductive closure training can improve the accuracy of language models without weighing them down.

Large language models are good at a lot of things. They’re good at building databases, labeling and analyzing data, generating code, chatting with customers, and many other functions related to generating text. What they’re not good at is checking their own work. They produce factually incorrect information and contradict themselves. But scientists at IBM Research are working toward ways to address this shortcoming.

At the Association for Computational Linguistics (ACL)’s annual conference, taking place August 11 to 16 in Bangkok, Thailand, several teams of IBM scientists and their external collaborators presented their research into ways language models can be pushed to improve themselves.

LLMs sometimes struggle when their work involves specific facts — like historical dates or legal statutes. They can also fall short in areas of human expertise where precision is important. These problems can be improved through strategies such as retrieval-augmented generation (RAG), re-training models with better data, fine tuning them with subject-specific materials, or adding context through prompts. But these methods can require massive compute power and human labor, putting them at odds with the cost- and work-saving goals of AI.

In one approach to the problem of language model inaccuracy, a group of researchers used a method they call deductive closure training, where a model generates text and then uses its own training data to evaluate the accuracy and consistency of the generated text. The other strategy involves a method called self-specialization, where a generalist pre-trained large language model can be efficiently turned into a subject matter specialist with a small number of labeled seeds, vastly outperforming the base model in specific fields — like finance or biomedicine.

IBM Research scientist Leshem Choshen collaborated with researchers at MIT, Boston University, and Monash University in Indonesia to develop deductive closure training. Language models appear knowledgeable, but all they produce are predictions of words and phrases — an appearance of knowledge that doesn’t reflect a coherent grasp on the world. They don’t possess knowledge in the way that a person does. They’re also expensive and labor-intensive to retrain, edit, and bring up to date.

One past strategy for improving accuracy has been to locate specific issues or inaccuracies and correct them. With this approach, the model won’t get it wrong again in the future, but you haven’t addressed the central problem, which is that language models are prone to inaccuracies: Asking the same question in a different way will often return a different answer. So what Choshen and his team set out to do is train a language model to be consistent. And given that they already had this powerful tool at their disposal, he asked, “Why not let it be both student and teacher?”

Deductive closure training uses the LLM itself to screen its own content for inaccuracies and contradictions. It’s both a form of supervised model updating and unsupervised improvement. The way it works is that the model starts with a seed of information and creates a cloud of related statements, irrespective of whether they’re true or not: Starting with “the sky is green,” a model may say "Broccoli is the color of the sky,” “The sky is blue,” and so on. This is a classic natural language processing task that gives you a notion of what you expect pairs of generated statements to be.

But this left the team with a mishmash of fact and fiction. The next step was for the model to analyze the probability that each statement is true, creating a graph it judges to be most consistent. The rest of the statements are marked false. In the above example, the broccoli statement is marked untrue. The last step is fine tuning the model on the truth values. In that example, the model is being trained in an unsupervised fashion, because no human had ever rated the initial statement as true. Deductive closure training can also be used as a supervised update, though, by giving seed statements that we know to be true. In experiments, Choshen and his team found they could use the supervised version of deductive closure training to increase text generation accuracy by up to 26%.

That improvement may be modest, but in the bigger picture, this work reveals tough truths about LLMs, Choshen says. One big takeaway is that they’re not consistent — because they weren’t meant to be. “Models were never factual machines,” Choshen says. “They’re text generation machines, and they give you text that’s probable to see in the world. It doesn’t have to be true.” In this case, he and his colleagues found that they could force a model to think about factuality by pounding its answers against each other, sort of a sophisticated form of inference, he says.

Other researchers have attempted to work around information issues by performing multiple inference attempts and aggregating the answers. Some have tried techniques more like what Choshen’s team is doing, fine-tuning models on self-generated text, but without the step where they check it for implications or logic.

The other team, a collaboration between MIT and the MIT-IBM Watson AI lab wanted to address language models’ shortcomings from a different angle, by improving their ability to handle expert-level subject matter. This team, co-led by IBM Research scientist Leonid Karlinsky, took an approach called self-specialization, which uses a combination of in-context learning and synthetic data to turn an LLM into a specialist on a given topic.

With self-specialization, the model ingests seed material about a topic area to carve out an expert model that outperforms the generalist one it came from. This material comes in the form of simple instructions and inputs, written by people, for each area. For example, they could give the model a genetics dataset and ask the model to generate a report on the gene variants and mutations it contains. With a small number of these seeds planted, the model begins generating new instructions and responses, calling on the latent expertise in its training data and using RAG to pull facts from external databases when necessary to ensure accuracy.

In this process, the model is building on its prior training and enhancing expertise in a specific area. “With very few examples, we can produce data that’s more tailored, that’s closer to what we need,” Karlinsky says.

They tested their self-specialized models against their base model, Databricks’ MPT-30B. In experiments, self-specialization showed significant improvements in the F1 score, a measure of the balance between precision and recall, in a majority of the biomedicine and finance datasets they tested the models on. By contrast and as a sort of baseline comparison, they looked at Alpaca and Dromedary, two models tuned from Meta’s LLaMA-65B base model. Those models showed only modest improvements in handling biomedical content over the base model after receiving generalist alignment. The team’s self-specialized models also performed on par with or better than two other models that were pre-trained on medical content. They also showed those models could be further improved by their method.

These self-specialized models are lightweight, acting like satellites around the main model, says Karlinsky. They can be called upon when a classifier or an API call requests their expertise, he says, but they aren’t using compute power when they’re not needed. The models also showed improvements in F1 score across knowledge areas, outside the ones they’d just self-specialized on.

Following up on this work, Karlinsky and his colleagues have found a way to merge multiple separately specialized models back together. In experiments, this approach produced a model that improves upon all the specialized models — and of course the base model — in all specialization areas.