A new tool to unlock data from enterprise documents for generative AI
IBM’s new open-source toolkit, Docling, allows developers to more easily convert PDFs, manuals, and slide decks into specialized data for customizing enterprise AI models and grounding them on trusted information.
IBM’s new open-source toolkit, Docling, allows developers to more easily convert PDFs, manuals, and slide decks into specialized data for customizing enterprise AI models and grounding them on trusted information.
Today’s foundation models have been trained on just about every scrap of publicly available information on the internet — text, images, and video. But much of the data that’s valuable to enterprises is stashed away in PDFs, annual reports, slide decks, and other complex business documents.
Docling, IBM’s new open-source toolkit, is designed to more easily unearth that information for generative AI applications. The toolkit streamlines the process of turning unstructured documents into JSON and Markdown files that are easy for large language models (LLMs) and other foundation models to digest. Once machine-readable, this data can be used to train and customize AI models and AI agents for enterprise tasks and to ground them on the latest, most accurate facts through retrieval-augmented generation (RAG).
Docling features a command-line interface, a Python API, and is small enough to run on a standard laptop. It takes just five lines of code to set up, and it integrates seamlessly with open-source LLM frameworks like LlamaIndex and LangChain for RAG and question-answering applications. Its permissive, open-source MIT license allows developers to collaborate and expand the project to meet their needs.
Since it was open-sourced in July, Docling has received more than 8,000 stars on GitHub and praise in social forums across the internet. “The output quality is the best of all the open-source solutions,” one developer recently wrote on Reddit.
Traditionally, developers have relied on optical character recognition (OCR) for digitizing documents. It’s a technology that can be error prone and slow, because of its heavy computational processing. Docling sidesteps OCR when it can, in favor of computer vision models trained to recognize and categorize the visual elements on a page.
“Avoiding OCR reduces errors, and it also speeds up the time-to-solution by 30 times,” says Peter Staar, an IBM researcher who helped build Docling.
Docling is built on top of two models designed by IBM researchers. One is a vision model that uses object-detection techniques to dissect the layout of a page in documents as varied as a machine operating manuals or annual reports. Next, it identifies and classifies blocks of text, images, tables, captions, and other elements. Trained on nearly 81,000 manually labeled pages from patents, manuals, and 10-K filings, IBM’s model came within five percentage points of matching the ability of us humans to correctly identify footnotes, titles, and other page elements, researchers reported when it was released.
The second model, TableFormer, is designed to transform image-based tables into machine-readable formats with rows and columns of cells. Tables are a rich source of information, but because many of them lie buried in paper reports, they’ve historically been difficult for machines to parse. TableFormer was developed for IBM’s earlier DeepSearch project to excavate this data. In internal tests, TableFormer outperformed leading table-recognition tools.
The Research team behind IBM and Red Hat’s InstructLab project used Docling to extract reams of information from targeted PDFs to train InstructLab’s underlying AI models.
It was also used to process 2.1 million PDFs from the Common Crawl, transforming raw internet data into useful AI training data. In the near future, the team plans to use Docling to process 1.8 billion PDFs and integrate the extracted data into a forthcoming IBM Granite multimodal model.
Additionally, Docling is part of Watson Document Understanding, which contributes to a number of IBM software products, including watsonx.ai.
Researchers plan to build out Docling’s capabilities so that it can handle more complex data types, including math equations, charts, and business forms.
Their overall aim is to unlock the full potential of enterprise data for AI applications, from analyzing legal documents to grounding LLM responses on corporate policy documents to extracting insights from technical manuals.
Later this month, Red Hat is expected to integrate Docling into its RHEL AI operating system as it did with InstructLab earlier this year, allowing companies to fine-tune AI models on their own data.
“The biggest roadblock we encounter in working with clients is preparing their proprietary data for InstructLab to use,” said Akash Srivastava, a researcher at IBM and principal AI product advisor at RedHat who co-developed InstructLab's underlying technology. “There isn’t an open-source tool of Docling’s caliber out there. A lot of innovative research went into solving this problem of how to make knowledge in textbooks and PDFs accessible for LLMs and RAG.”