A team of IBMers built a “matching pairs” classifier to compare responses from the tuned models to select base models. They devised a method to select prompts that would elicit clues about the models’ underlying training data. If a pair of responses were close enough, it likely meant the models were related.
“Genetics tells you how people are related,” said Rawat, at IBM. “It’s the same thing with LLMs, except their characteristics are encoded in their architecture, and the data and algorithms used to train them.”
Rawat and his colleagues recently presented their work at this year’s Association of Computational Linguistics conference (ACL). (Check out their demo here). “The advantage of automating the attribution task with ML is that you can find the origins of one particular model in a sea of models,” he said.
Other tools are being built to provide insight into an LLM’s behavior, allowing users to trace its output to the prompts and data points that produced it. One recent algorithm uses contrastive explanations to show how a slightly reworded prompt can change the model’s prediction. For example, adding the word “jobs” to the news headline, “Many technologies may be a waste of time and money,” can cause the model to categorize the story as “business” instead of “science and technology.” Another new algorithm can pick out the training data that most contributed to the model’s response.
IBM has long advocated for explainable, trustworthy AI. In 2018, IBM was the first in the industry to launch a free library of bias mitigation algorithms, the AI Fairness 360 (AIF360) toolkit, and incorporate bias mitigation and explainability into its own products. These features are embedded in watsonx and will be strengthened with the November release of watsonx.governance, a toolkit for driving responsible, transparent, and explainable AI workflows.
IBM will also continue to work on a broad set of transparency tools available to everyone. “Source attribution is a key to making foundation models trustworthy,” says IBM researcher Kush Varshney. “If you know the source of what you’re reading, you can evaluate its accuracy and whether it’s been plagiarized or has been improperly leaked.”