IBM Granite now has eyes
IBM’s new vision-language model for enterprise AI can extract knowledge locked away in tables, charts, and other graphics, bringing enterprises closer to automating a range of document understanding tasks.
A picture can be worth a thousand words. Flip through any annual report, and the colorful charts, tables, and graphics within help readers key in on the important points.
Data visualizations make complex information more accessible, and even memorable, distilling a sea of words and numbers into a tight, compelling story. And while AI models excel at summarizing pages of text, they often miss the big picture when it comes to tidy visualizations.
The ability to grasp important takeaways in a chart or table involves knowing how to interpret closely entwined linguistic and graphical information. Even multi-modal language models trained on both text and images can struggle to make sense of graphical data that us humans find so compelling.
To close this gap, IBM Research set out to build an open-source vision-language language model (VLM) that could analyze not only natural images but the charts, tables, and other data visualizations that are the mainstay of enterprise reports. The first version of Granite Vision, released under an Apache 2.0 license, is now available on Hugging Face.
Granite Vision is fast and inexpensive to run. It's also competitive with other small, open-source VLMs at extracting information from the tables, charts, and diagrams featured in popular document understanding benchmarks.
Granite Vision is built on IBM's state-of-the-art 2 billion-parameter Granite language model, which includes a larger context window of 128,000 tokens, improved function calling, and greater accuracy on retrieval-augmented generation (RAG) tasks. Granite Vision was fine-tuned on about 13.7 million pages of enterprise documents and 4.2 million natural images. As with prior Granite releases, IBM rigorously vetted its training data to filter out personal, proprietary, and toxic information.
The model’s visual capabilities come from an encoder that turns input images into numerical visual embeddings, and a projector that translates those embeddings into text embeddings that a language model can read. During training, these representations are aligned with text embeddings corresponding to questions about the image, so that when the model is queried about a never-before-seen image, it knows how to extract the right information and generate a logical response.
In addition to raw images, Granite Vision was aligned on nearly a hundred million pairs of questions and answers corresponding to the content of the images it trained on — 80.3 million pairs corresponding to document images, and 16.3 million pairs to natural images.
For this first Granite Vision release, researchers concentrated on document understanding, a skill that involves breaking down the layout and visual elements on a page to be able to make high-level inferences about its content.
They used Docling, IBM’s open-source document conversion toolkit, to create a structured dataset from 85 million raw PDF pages scraped from software programs and the web — things like receipts, business forms, and car accident photos.
From a randomly selected subset of this data, they used a Mixtral LLM to generate 26 million pairs of synthetic questions-and-answers. To generate more challenging questions, they enriched the underlying documents with verbalized descriptions of graphical elements, and augmented tables with additional calculations.
By making the questions more difficult, they hoped that Granite Vision would develop a deeper understanding of the material, which focused heavily on charts, tables, and diagrams visualizing business processes. [insert figure 1] They also included invoices, resumes, and other forms with pre-defined fields that machines can have trouble interpreting.
The targeted instruction data may explain why Granite Vision outperformed other VLMs — including some double the size or larger — on the popular ChartVQA benchmark as well as IBM’s new LiveXiv benchmark that changes monthly to minimize the chances that a model trained on the test material.
Having an AI in the workplace to break down visual documents can save time. It can also bring enterprises closer to automating visual reasoning tasks that are either highly repetitive or require levels of precision that only machines can achieve.
In future Granite Vision releases, researchers plan to expand the model’s capacity to analyze natural images so it can take on other types of enterprise tasks. This could include things like identifying product defects, extracting car accident information from photographs, or processing hundreds of invoices at once.
They also plan to extend the model to multi-page documents. Training LLMs and VLMs on multi-page data is challenging because model context windows are often too small. To deal with the spillover, models have to process the data at a lower resolution which can harm performance. Generating questions that refer to several pages of information at once is also more technically challenging.
To screen incoming user prompts for dangerous or inappropriate information, the team plans to add a flexible safety module to future Granite Vision releases. The module provides a way to essentially teach the model to recognize unsafe text and imagery, via sparse attention vectors, without having to change the model's weights and thus risk lowering its overall performance.
Stay tuned for future updates to Granite Vision.