IBM Research’s open-source toolkit for Deep Search
For the first time, we’re open-sourcing part of the IBM Deep Search Experience in a new toolkit, with the goal of spurring on the rate of scientific discovery.
For the first time, we’re open-sourcing part of the IBM Deep Search Experience in a new toolkit, with the goal of spurring on the rate of scientific discovery.
Every organization is built on documents, from legal briefs, financial statements, and technical specifications, to research papers, and slide decks. These documents are packed with valuable information, but their contents are often not easy to search, because they’re in an unstructured format that can’t easily be transferred to a database. In fact, IDC estimates that 80% of all global data will be unstructured by 2025.
IBM Research’s Deep Search already allows scientists and businesses to search mountains of unstructured data. But now, we’re making deep search even more versatile and accessible with the release of Deep Search for Scientific Discovery (DS4SD), an open-source toolkit for scientific research and businesses.
Following the launch of the Generative Toolkit for Scientific Discovery (GT4SD) in March, the availability of DS4SD marks the next leap toward our ultimate goal of building an Open Science Hub for Accelerated Discovery.
To help achieve this goal, we’re now publicly releasing a key component of the Deep Search Experience, our automatic document conversion service. It allows users to upload documents in an interactive fashion to inspect a document’s conversion quality. DS4SD has a simple drag-and-drop interface, making it very easy for non-experts to use. We’re also releasing deepsearch-toolkit, a Python package, where users can programmatically upload and convert documents in bulk. Users can point to a folder and direct the toolkit to upload the documents, convert them, and ultimately analyse the contents of the text, tables, and figures.
The new toolkit interacts and integrates with existing services, and is available to data scientists and engineers through our Python package. Because the toolkit is open source, we welcome contributions from the developer community.
There is a lot of value in unstructured data for scientific research. Consider IBM’s Project Photoresist, for example: We used Deep Search in 2020 to find and synthesize a novel photoacid generator molecule for semiconductor manufacturing. These generators pose environmental risks and we wanted to discover a better option. Deep Search can ingest data up to 1,000 times faster and screen the data up to 100 times faster than a manual alternative, which allowed us to identify three candidate photoacid generators by the end of 20201. With our end-to-end, AI-powered workflow, we scaled and handled the problem with a speed that human scientists simply cannot match, dramatically accelerating the discovery process2.
Deep Search uses AI to collect, convert, curate, and ultimately search huge document collections for information that is too specific for common search tools to handle. It collects data from public, private, structured, and unstructured sources and leverages state-of-the-art AI methods 3, 4, 5, 6 to convert PDF documents into easily decipherable JSON format with a uniform schema that is ideal for today’s data scientists. It then applies dedicated natural language processing and computer vision machine-learning algorithms on these documents and ultimately creates searchable knowledge graphs.
The resulting datasets can be used to help businesses make models and identify key trends that inform their decisions. For example, they could match a target acquisition’s financial performance over the past five years, as well as executive turnover during that time. There are exciting applications for Deep Search in healthcare, climate science, and materials research — anywhere large document collections have to be searched — and Deep Search makes it easier to get started.
Deep Search previously required users to provide their own data or documents to be searched. We’ve now added over 364 million public documents, such as patents and research papers. Commercial users of Deep Search can quickly get started searching this data, adding their own data incrementally as well.
The public release of our automatic document conversion service is only the first step for DS4SD. New capabilities, such as AI models and high quality data-sources, will be made available in the future.
References
-
Photoacid generator, GI Meijer, V Weber, Peter WJ Staar US Patent App. 17/101,148 (US20220163886A1) ↩
-
Accelerating materials discovery using artificial intelligence, high performance computing and robotics EO Pyzer-Knapp, JW Pitera, PWJ Staar, S Takeda, T Laino, DP Sanders, … npj Computational Materials 8 (1), 1-9 ↩
-
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis B Pfitzmann, C Auer, M Dolfi, AS Nassar, PWJ Staar arXiv preprint arXiv:2206.01062 ↩
-
TableFormer: Table Structure Understanding with Transformers A Nassar, N Livathinos, M Lysak, P Staar Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern … ↩
-
Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness C Auer, M Dolfi, A Carvalho, CB Ramis, PWJ StaararXiv preprint arXiv:2206.00785 ↩
-
Robust PDF document conversion using recurrent neural networks N Livathinos, C Berrospi, M Lysak, V Kuropiatnyk, A Nassar, A Carvalho, ... Proceedings of the AAAI Conference on Artificial Intelligence 35 (17), 15137 … ↩