Foundation Models for Documents


Business documents are central to many enterprise business processes and lie at the heart of digital transformation. Such documents include contracts, loan applications, invoices, purchase orders, financial statements, and many more. The information in these documents is presented in natural language and is often unstructured. Understanding these documents is challenging, due to complex document layouts and content such as tables, charts, infographics, etc. It is often even more challenging because of poor quality, noisy scans, or inadequate or inaccurate OCR.

The ability to read these business documents, either programmatically or by OCR, and interpret and extract their content so that it can be used in downstream automatic business processes is referred to as Document Understanding (DU). We address this as a multi-disciplinary challenge, spanning across computer vision as well as natural language understanding, information representation, and model optimization, thus advancing the state of the art in document understanding.

In 2021, we participated in the DI@KDD2021 workshop and shared publications about this research (see list of publications below). In 2022 and 2023, our work on IOCR, GTE (Global Table Extraction) and KVP (Key Value Pair) extraction was released, via Watson Document Understanding (WDU), in the IBM Automation ADP product and in IBM Discovery. WDU is the target for everything we build in the DU domain.

In 2023, we began focusing on two additional main tasks: building a foundation model for documents (FMD) and building blind-KVP technology that can extract keys and values from any document, even those that the model has never seen before. The first results of FMD are already out, allowing us to improve existing models for KVP for invoices, utility bills, and document classification.

We’re now focusing on a multimodal foundation model (MMFM), extending the work on FMD even further.

An example of a Blind-KVP output and results of IOCR on a low quality scanned image
Left: An example of a Blind-KVP output. Right: Results of IOCR on a low quality scanned image.
Image capture of a document QA app using WDU and watsonx
Image capture of a document QA app using WDU and watsonx.