Scientific documents such as papers, reports and patents but also other professional documents such as financial or medical reports very often include numerous diagrams. The goal of these diagrams is to illustrate, in a graphical way, data sets that explain, describe or emphasize the textual content of those documents.
Such data sets can be generated by experiments, measurements, observations or other means, and are uniquely depicted in these technical diagrams for the reader to extract the message they convey in a fast and efficient way.
With the emergence of Internet searches and archival storage, together with the speed at which new scientific documents are constantly being created, it is of great value to have tools that can not only scan numerous documents and extract their main scientific information automatically, but also present this information in a concise and meaningful way.
However, for a document to be completely and thoroughly analyzed, its diagrams also need to be processed in order to extract the key information presented by the depicted data sets.
The problem is that such graphs are typically stored as bitmap images, the data sets are often very noisy, and the graphic symbols used to depict the data, such as lines, markers and labeling, often overlap, intersect or otherwise override each other.
Cognitive analysis for extracting knowledge from graphics in documents
At IBM Research in Zurich, we are developing computational techniques based on image processing and machine learning to extract the data sets — and in turn the information they represent — automatically. From the taxonomy of various diagrams, we are currently focusing on line and scatter plots.
Conventional information retrieval focuses on text. Our challenge is to extract non-textual information from a variety of sources.
Cognitive businesses base their decision-making process on insights extracted from the vast amount of available data.
Computers are blind to graphics in documents.
A tool that can automatically scan through a surfeit of documents, extract the main scientific information, and present it in a concise and meaningful way is of great value.
Comparison between data sets across different documents enables
- Competitive study analysis,
- Establishment of trends.
Increases confidence in extracted information.
Automatic graph generation
Valuable addition to document image analysis, document understanding and information retrieval in digital documents.
Semantic understanding of technical diagrams.
Diagram taxonomy: 10 types of diagrams plus flowcharts
Data acquired from public sources is huge
Overlapping, overwriting, noisy
Lack of truth
No labels, no exact values
To analyze various sources such as graphs, diagrams and images, we first extract the basic objects, such as line segments, characters, and their topological relations to generate structural primitives such as lines, boxes, strings or grids.
Next, we map these basic objects into diagram semantics such as axes, labels or legends to create a kind of “graph grammar”.
Finally, we can extract valuable data from previously inaccessible sources.
Diagrams under study
The research group in Zurich studied line and scatter plots, molecular pathways, bubble plots, flow charts, scanned tables, and scanned forms.