Deep Search

Overview

IBM Deep Search uses AI to collect, convert, curate, and ultimately search large document collections like public documents, such as patents and research papers. It makes information accessible that is too specific for common search tools to handle. It collects data from public, private, structured, and unstructured sources and leverages state-of-the-art AI methods to convert PDF documents into easily decipherable JSON format with a uniform schema that is ideal for today’s data scientists. It then applies dedicated natural language processing and computer vision machine-learning algorithms on these documents and ultimately creates searchable knowledge graphs.

IBM Deep Search has already allowed scientists and businesses to search mountains of unstructured data for a while. In 2022, our team made deep search even more versatile and accessible with the release of IBM Deep Search for Scientific Discovery (DS4SD), an open-source toolkit for scientific research and businesses.

You can try our demo here or find out more about IBM's research in accelerated discovery.

Publications

Identifying global biases in hydro-hazard research by mining the scientific literature
- - Lina Stein
  - Karthik Mukkavilli
  - et al.
- 2024
- EGU 2024
ESG Accountability Made Easy: DocQA at Your Service
- - Lokesh Mishra
  - Cesar Berrospi Ramis
  - et al.
- 2024
- AAAI 2024
MolGrapher: Graph-based Visual Recognition of Chemical Structures
- - Lucas Morin
  - Martin Danelljan
  - et al.
- 2023
- ICCV 2023
Optimized Table Tokenization for Table Structure Recognition
- - Maxim Lysak
  - Ahmed Nassar
  - et al.
- 2023
- ICDAR 2023
PatCID: Large-scale chemical-structure database from images in patent documents
- - Ingmar Meijer
  - Valery Weber
  - et al.
- 2023
- ACS Fall 2023
Extracting Text Representations for Terms and Phrases in Technical Domains
- - Francesco Fusco
  - Diego Antognini
- 2023
- ACL 2023
pNLP-Mixer: an Efficient all-MLP Architecture for Language
- - Francesco Fusco
  - Peter Staar
  - et al.
- 2023
- ACL 2023
Unsupervised Term Extraction for Highly Technical Domains
- - Francesco Fusco
  - Peter Staar
  - et al.
- 2022
- EMNLP 2022

Resources

Blog Post

Docling: The missing document processing companion for generative AI

Red Hat13 Nov 2024

Blog Post

A new tool to unlock data from enterprise documents for generative AI

IBM Research Blog12 Nov 2024

Blog Post

AI is making extracting key information from reports easier than ever

IBM Research Blog22 Feb 2024

Blog Post

IBM Research’s open-source toolkit for Deep Search

IBM Research Blog11 Jul 2022

Presentation

Deep Search for Scientific Discovery

IBM Research07 Jul 2022

Blog Post

Deep Document Understanding: IBM’s AI extracts data from complex documents

IBM Research Blog15 Apr 2021

Contributors

PS

Peter Staar

Peter Staar

CA

Christoph Auer

Christoph Auer

CB

Cesar Berrospi Ramis

Cesar Berrospi Ramis

MD

Michele Dolfi

Michele Dolfi

KD

Kasper Dinkla

Kasper Dinkla

YK

Yusik Kim

Yusik Kim

VK

Viktor Kuropiatnyk

Viktor Kuropiatnyk

NL

Nikos Livathinos

Nikos Livathinos

ML

Maksym Lysak

Maksym Lysak

IM

Ingmar Meijer

Ingmar Meijer

LM

Lucas Morin

Lucas Morin

AN

Ahmed Nassar

Ahmed Nassar

RT

Rafael Teixeira de Lima

Rafael Teixeira de Lima

PV

Panos Vagenas

Panos Vagenas