IBM Research

Documents have always been, and continue to be, a significant data source for any business or corporation. For physical documents, the ability to scan and digitize them is crucial in order to extract their information and represent them in a way that allows for further analysis.

To perform the digitization of documents, Optical Character Recognition (OCR) is utilized. OCR is composed of a Detection stage where the various words in the document are localized, and a Recognition stage to identify the comprising characters in the detected words.

Challenges in OCR arise when documents are captured under non-ideal conditions – incorrect scanner settings, insufficient resolution, bad lighting (with mobile capture), loss of focus, unaligned pages, and added artifacts from badly printed documents. This is one of our focus areas.

Our second focus area addresses documents that have irregular fonts or a variety of colors, font sizes, and backgrounds. These documents often need a more powerful technology such as Scene Text Recognition (STR) which aims to perform OCR in natural scene images, taken in the wild (extracting text from street signs, name tags, logos, and billboards,). Though designed for text in natural images and settings, the STR technology can provide a huge benefit for challenging documents.

Figure 1: An example of a low quality document and the results of our OCR model, overlayed on top of the image

Imagine that you are going to build a computer vision system for reading text in documents or extracting structural and visual elements – you will need a lot of data that has to be labeled and sanitized for human errors. At some point you might understand that you require a different granularity of classes to train a better model, but acquiring new labeled data is costly. You will likely have to make some compromises or use a narrower set of training regimes which may affect accuracy. Now, imagine that you can quickly synthesize all the data you need, how would that affect the way you will approach the problem?

Synthetic data is at the core of our work in document understanding, and at the core of our high accuracy technology. Since we require significant amounts of data to train our models, data that is hard to acquire and annotate – we are creating new methods to synthesize data and apply optimization techniques to increase our architecture accuracy given that the synthetic data can be altered.

We are synthesizing data for object segmentation, text recognition, NLP based grammatic correction models, entity grouping, semantic classification and entity linkage.

Another advantage of synthetic data generation is the ability to control the granularity and format of the labels, which enables to design architectures that can recognize punctuation, layout, handwritten characters, form elements.

Our OCR and STR models accuracy surpass those of established vendors, a lot due to the training on our synthesized data.

Figure 2: An example of a synthetic image generated by one of our models

Figure 3: Results of our STR model

In addition, we are working on extracting information from documents into a structured form that is amenable for NLP processing. Our models analyze layout and reading order in complex documents, understand visuals and represent them in multimodality manners, understand plots, charts, diagrams.

Document Understanding

Data Synthesis for OCR and STR

Higher level Document Understanding