IBM researchers are developing AI-text detection and attribution tools to make generative AI more transparent and trustworthy.
Large language models are the ultimate polyglots, code-switching easily from the language of lawyers and marketers to rap artists and poets. But LLMs are not so great at detecting content they themselves wrote or tracing a tuned model to its source. As generative AI continues to reshape day-to-day communication, researchers are working on new tools to make generative AI more explainable.
Today, anyone with a computer can pull a foundation model off the internet and adapt it for another use. Thanks to an AI architecture known as a transformer, foundation models can take a mountain of raw data — text, code, or images scraped from the Web — and infer its underlying structure. That makes foundation models easy to tailor to a wide variety of tasks with a small amount of extra data and fine-tuning.
Enterprises could see enormous productivity gains with foundation models that can automate time-consuming tasks; it’s what inspired the recent launch of watsonx, IBM’s new AI and data platform. But foundation models, especially those capable of generating new content, also pose new risks.
These models can leak sensitive information, scale the spread of misinformation, and make it easier to plagiarize or steal others’ work. They have turned intellectual property on its head by blurring the line between creators and consumers. If a work of art has been transformed with AI who owns it?
Foundation models, and their tuned offspring, are also proliferating rapidly. Today, more than 12,000 text-generating models alone are available on Hugging Face, the open-source AI platform. “If you know the source of these models, you can gauge the reputation of its creators and the training data they used,” said IBM researcher Ambrish Rawat. “Transparency can help to build trust and ensure accountability.”
IBM is currently adapting its trustworthy AI toolkit for the foundation model era and developing tools to make generative AI more transparent. Here's some of the latest work coming out of IBM Research.
ChatGPT has amassed more than 100 million users since its release in November, making it one of the fastest-growing applications ever. “You just open your browser and use it,” said IBM researcher Pin-Yu Chen, an expert on AI safety. “AI-generated content can now spread much faster.”
The spread of generative AI has raised concerns of not only a coming deluge of disinformation, but rampant plagiarism. In June, a law firm was fined $5,000 after one of its lawyers used ChatGPT to write a court brief filled with hallucinated cases. Many schools and academic conferences have banned AI-generated content, including top AI-research venues like the International Conference on Machine Learning (ICML).
Years before generative AI became a household phrase, IBM Research and Harvard helped develop one of the first AI-text detectors, GLTR. Since then, a cottage industry has emerged: some AI-text detectors look for a tell-tale watermark embedded in artificially generated text. Others analyze the statistical relationships among words to differentiate human from machine-written text.
Not surprisingly, evasion techniques are also evolving. A popular one disguises AI-generated text by rewording it, often using another LLM. It’s this vulnerability that Chen and his colleagues exploit with their new tool, RADAR. It identifies text that’s been paraphrased to fool AI-text detectors. (Check out their demo here).
Chen has spent most of his career devising new ways to attack computer vision models to find and fix hidden security flaws. Here, he pits a pair of language models against each other to accomplish a similar end; one model paraphrases a snippet of AI-generated text; the other decides whether an AI generated it. The game continues until both models can generate and identify highly nuanced AI-paraphrased text.
“Each time the detector fails, it learns what samples made it fail,” said Chen. “The paraphraser iteratively trains the detector to be robust to the original and paraphrased AI-generated texts.” In experiments, RADAR outperformed leading AI-text detectors at both tasks, Chen reports in a paper that hasn’t yet been peer-reviewed.
Chen and his colleagues are next aiming to address prompt-injection attacks when a bad actor uses a carefully worded prompt to coax a generative model into leaking proprietary data or spewing toxic comments.
To prevent leaks of client data, IBM recently joined others like Apple and JP Morgan in restricting employee access to third-party models like ChatGPT. If safety controls are improved, said Chen, more employees could be able to experiment with and use generative AI in their work.
Once a piece of text has been confirmed as AI-generated, the next challenge is finding the model that produced it. But identifying AI models in the wild, with no equivalent of a license plate or serial number, is harder than it sounds. As a first step, a group led by Hugging Face and the AI-security startup Robust Intelligence, launched an attribution challenge last year to spur new research in an emerging field called AI attribution.
“The more you know about the base model and its training data, the more you know about the security, ethical, and operational risks of the downstream model,” said Hyrum Anderson, a Robust Intelligence researcher who helped organize the contest. "It's important not just to identify sources of misinformation, it's also critical for managing AI supply chain risk."
During the challenge, researchers were asked to trace two dozen tuned models to their source foundation models. Through natural language and AI-created prompts, researchers prodded the tuned models into giving up clues that could point them to the correct parent, including GPT-2, BLOOM, and XLNet.
A team of IBMers built a “matching pairs” classifier to compare responses from the tuned models to select base models. They devised a method to select prompts that would elicit clues about the models’ underlying training data. If a pair of responses were close enough, it likely meant the models were related.
“Genetics tells you how people are related,” said Rawat, at IBM. “It’s the same thing with LLMs, except their characteristics are encoded in their architecture, and the data and algorithms used to train them.”
Rawat and his colleagues recently presented their work at this year’s Association of Computational Linguistics conference (ACL). (Check out their demo here). “The advantage of automating the attribution task with ML is that you can find the origins of one particular model in a sea of models,” he said.
Other tools are being built to provide insight into an LLM’s behavior, allowing users to trace its output to the prompts and data points that produced it. One recent algorithm uses contrastive explanations to show how a slightly reworded prompt can change the model’s prediction. For example, adding the word “jobs” to the news headline, “Many technologies may be a waste of time and money,” can cause the model to categorize the story as “business” instead of “science and technology.” Another new algorithm can pick out the training data that most contributed to the model’s response.
IBM has long advocated for explainable, trustworthy AI. In 2018, IBM was the first in the industry to launch a free library of bias mitigation algorithms, the AI Fairness 360 (AIF360) toolkit, and incorporate bias mitigation and explainability into its own products. These features are embedded in watsonx and will be strengthened with the November release of watsonx.governance, a toolkit for driving responsible, transparent, and explainable AI workflows.
IBM will also continue to work on a broad set of transparency tools available to everyone. “Source attribution is a key to making foundation models trustworthy,” says IBM researcher Kush Varshney. “If you know the source of what you’re reading, you can evaluate its accuracy and whether it’s been plagiarized or has been improperly leaked.”