Deep Document Understanding: IBM’s AI extracts data from complex documents

Novel deep learning architectures could help organizations, enterprises, and data scientists to easily extract data from vast collections of documents.

Artificial intelligence can extract data from documents — but often, not well enough.

As soon as there are uncommon fonts, text that is not well aligned, or complex visuals like tables and charts, the automatic data extraction stumbles. For example, during the ongoing pandemic, vast amounts of COVID-19 papers distributed around the world have required deep document understanding, and it hasn’t always been easy to extract the data.

We want to help. Using novel deep learning architectures, we have developed AI models that could help organizations, enterprises, and data scientists to easily extract data from vast collections of documents. Our technology allows users to quickly customize high-quality extraction models. It transforms the documents, making it possible to use the text they contain for other downstream processes such as building a knowledge graph out of the extracted content. We are also working with the broader research community to provide high-quality venues for document intelligence research, such as the upcoming Document Intelligence Workshop co-located with KDD 2021, which has researchers from IBM Research as co-organizers and featured speakers.

TableLab and more

We describe our Deep Document Understanding (DDU) approach to extract information from complex documents containing tables in a recent paper “TableLab: An Interactive Table Extraction System with Adaptive Deep Learning,” unveiled at IUI 2021 during the demonstration session on April 15 at 4:00 P.M. US CDT.¹

In the paper, we detail an AI given a few labelled examples from the user’s document collection as input. The AI detects tables with similar structures by clustering embeddings from the extraction model and selects a few representative table examples already extracted with a pre-trained base deep learning model.

A scanned document displaying data in a poorly structured table. — Original document text

The same data in a structured table. — Output from table extraction

With the help of an easy-to-use interface, users provide feedback to these selections without necessarily having to identify every single error. TableLab then applies the feedback to fine-tune the pre-trained model and returns the results of the model back to the user, who can choose to repeat this process iteratively until obtaining a customized model with satisfactory performance.

In our initial study on common enterprise document types, such as invoices, contracts and financial reports, we found that even a single fine-tuning round improved table boundary recognition accuracy to over 90 percent (F1) for all document types and table cell structure identification accuracy improved between 17 and 30 percent (F1), depending on document type.

A screenshot of the TableLab user interface (UI) for collecting user feedback to improve GTE table extraction — TableLab user interface (UI) for collecting user feedback to improve GTE table extraction

TableLab is just one of several technologies we are developing at IBM Research to improve deep document understanding. There’s more.

Another team has applied deep learning methods to tackle problems traditionally solved with classical computer vision approaches. As deep learning models require large amounts of data for training, the team creates synthetic data that maximizes the accuracy of the models, enabling the AI to analyze challenging low-quality documents. The initial results are very promising. A new model for Optical Character Recognition (OCR) trained on this synthetic data greatly boosts accuracy, both in terms of localizing text in low-quality documents and in terms of text recognition.

Our researchers have also created high-quality deep-learning This technology received the IAAI Innovative Application Award at AAAI 2021.models to extract the overall layout of the documents in an unsupervised manner.² First, a cluster detection model predicts the locations of common layout components such as headings, paragraphs, tables, and figures. A fine-grained model based on sequence-encoders then predicts detailed labels for each text cell, for example identifying list levels, captions, metadata (authors, affiliations), and more.

To showcase how the combination of these techniques does the trick, we have created a video demo on the COVID-19 collection of documents (as well as other documents). The AI allows users to ask questions about the disease, which are answered based on the ingested content from various components of the documents — be it the text itself or tables, charts, and so on.

This research could help in a variety of other tasks, from getting the stats of your favorite football team to finding facts about a COVID vaccine. It could help companies extract content from an ingested collection of legal documents, index it and let business users search the data based on their needs. The users could then also ask natural language questions about the data, such as “What are our commitments to XYZ in 2022?”

Sometimes, it’s critical to be able to decipher the tiniest print on a noisy, blurry image. We hope that our research is helping create an AI that can do just that.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

Notes

Note 1: This technology received the IAAI Innovative Application Award at AAAI 2021. ↩︎

References

Wang, N. X. R., Burdick, D. & Li, Y. TableLab: An Interactive Table Extraction System with Adaptive Deep Learning. 26th International Conference on Intelligent User Interfaces 87–89 (2021). ↩
Livathinos, N. et al. Robust PDF Document Conversion Using Recurrent Neural Networks. arXiv:2102.09395 [cs] (2021). ↩

Bringing a common language to AI evaluation
News
Kim Martineau
23 Jul 2026
IBM is committing up to $50 million worth of quantum compute access for the US Genesis Mission, and more
News
22 Jul 2026
- AI
- Quantum
IBM open sources CodeAlchemy, a massive synthetic dataset of high-quality code
Release
Kim Martineau
16 Jul 2026
Replacing the ‘bones’ of transformer-based models
Research
Peter Hess
09 Jul 2026
- AI
- Generative AI

TableLab and more

Notes

References

Related posts

Bringing a common language to AI evaluation

IBM is committing up to $50 million worth of quantum compute access for the US Genesis Mission, and more

IBM open sources CodeAlchemy, a massive synthetic dataset of high-quality code

Replacing the ‘bones’ of transformer-based models