IBM Research unveils a benchmark aimed at training foundation models to perform specialized tasks.
Iconic benchmarks like ImageNet and GLUE have helped drive progress in AI by motivating people to develop better, more powerful algorithms for naming animals in photos and sorting good movie reviews from bad.
But what happens when your car breaks down? What you really need is an expert on auto engines, not dog breeds or rom-coms. Few of today’s deep-learning models are equipped with that kind of specialized, but highly useful, knowledge.
Into this technical expertise void comes FETA, a first-of-its-kind measure of how well an AI can read technical documentation crammed with words and images — things like machinery manuals, sales catalogs, and how-to guides. IBM researchers at NeurIPS this year will release FETA (short for Foundation models for Expert Task Applications), which includes an open-source tool for automatically converting illustrated technical material into data for training and evaluating expert AIs.
“There’s been a lot of progress in modeling what I’d call ‘the internet,’” said the study’s senior author, Leonid Karlinsky, a computer vision expert at the MIT-IBM Watson AI Lab. “We have these huge models trained on photos of household objects, dog breeds, and what people did on their vacation. But what I really think could be useful is AI that has the capacity to help us with boring, everyday tasks.”
As AI models get bigger, they've become astoundingly good at imitating humans. With a simple prompt, foundation models can fabricate Economist-style magazine covers or captions for never-before seen images. Their seemingly magical powers of mimicry come from the billions of unlabeled articles, photos, and videos they have learned from the web, forming a foundation of general-purpose knowledge, hence their name — foundation models.
With the right training data, these jack-of-all-trades AIs can be adapted to many practical tasks. But finding, labeling, and converting that specialized data to a machine-readable form can be tricky, especially if the data includes graphics or technical drawings which only loosely resemble their real-life counterparts. FETA solves this problem by automatically generating labels for abstract and ordinary images alike.
“There’s a lot of data out there that’s not natural images, or comments on social media,” said study co-author Assaf Arbelle, a vision researcher at IBM Research. “Manuals, diagrams, pamphlets, microscopy: Our goal is to transform this data into training examples for more business and science-oriented machine-learning tasks.”
FETA starts by extracting images and text from PDF documents and converting them to a machine-readable format using IBM’s Deep Search toolkit. FETA next finds the right words to match each image. Because technical documents often have more words than images, this can be a maddening process. If FETA isn’t sure about a caption, it selects up to five candidates from the same page. It then uses a technique called multiple-instance learning to weigh the options, compare them to similar instances, and pick the best match, a process that becomes faster and easier as the model iteratively improves.
The researchers used FETA to craft three AI-labeled datasets — a public dataset of manuals for discontinued cars, a copyright-protected dataset of Honda manuals, and a public dataset of four decades worth of IKEA catalogs — to show that FETA can automatically label a hodgepodge of natural images, too.
Finally, the group ran experiments to test how well a language-vision foundation model like OpenAI’s CLIP would perform before and after learning a stack of AI-labeled car manuals. The results were promising. In the best-case scenario, an out-of-the box CLIP model could accurately predict the car part that went with the caption, and vice versa, about half of the time. After training on the automatically labeled car manuals, it got the answer right about 70% of the time.
Still, a driver with some automotive knowledge — someone who took shop class in high school or grew up listening to Car Talk on NPR — could probably do better. One takeaway is that digesting a car manual is harder for both humans and machines than classifying golden retrievers or kitchen tables.
“We’re dealing with diagrams and text about relationships between car parts described in words and images,” said David Cox, co-director of the MIT-IBM lab. “This study is a sign that the field is ready to tackle these harder problems.”
To that end, IBM has made FETA available to anyone to use, like the dataset-benchmarks ImageNet and GLUE before it. “All of AI started with an impossible task that over time became trivial,” said study co-author Amit Alfassy, a PhD student at Technion. “Expert AI is now an open problem. Over time, we hope these tasks will also become trivial and that our tool can make life easier for small business owners who can’t train these huge models on their own.”