Deep Search

Overview

IBM Deep Search uses AI to collect, convert, curate, and ultimately search large document collections like public documents, such as patents and research papers. It makes information accessible that is too specific for common search tools to handle. It collects data from public, private, structured, and unstructured sources and leverages state-of-the-art AI methods to convert PDF documents into easily decipherable JSON format with a uniform schema that is ideal for today’s data scientists. It then applies dedicated natural language processing and computer vision machine-learning algorithms on these documents and ultimately creates searchable knowledge graphs.

IBM Deep Search has already allowed scientists and businesses to search mountains of unstructured data for a while. In 2022, our team made deep search even more versatile and accessible with the release of IBM Deep Search for Scientific Discovery (DS4SD), an open-source toolkit for scientific research and businesses.

You can try our demo here or find out more about IBM's research in accelerated discovery.

Publications

BusiNet - a Light and Fast Text Detection Network for Business Documents
- - Oshri Naparstek
  - Ophir Azulai
  - et al.
- 2022
- KDD 2022
Workshop paper
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation
- - Birgit Pfitzmann
  - Christoph Auer
  - et al.
- 2022
- KDD 2022
Conference paper
Unsupervised Term Extraction for Highly Technical Domains
- - Francesco Fusco
  - Peter Staar
  - et al.
- 2022
- EMNLP 2022
Conference paper
Unsupervised Domain Generalization by Learning a Bridge Across Domains
- - Sivan Harary
  - Eli Schwartz
  - et al.
- 2022
- CVPR 2022
Conference paper
TableFormer: Table Structure Understanding with Transformers
- - Ahmed Nassar
  - Nikolaos Livathinos
  - et al.
- 2022
- CVPR 2022
Conference paper
Racial Representation Analysis in Dermatology Academic Materials
- - Girmaw Abebe Tadesse
  - Celia Cintas
  - et al.
- 2021
- AMIA Annual Symposium 2021
Short paper
Robust PDF Document Conversion Using Recurrent Neural Networks
- - Nikolaos Livathinos
  - Cesar Berrospi
  - et al.
- 2021
- IAAI 2021
Poster
DCA++: A software framework to solve correlated electron problems with modern quantum cluster methods
- - Urs R. Hähner
  - Gonzalo Alvarez
  - et al.
- 2020
- Computer Physics Communications
Paper

Resources

Blog Post

Docling: The missing document processing companion for generative AI

Red HatNov 13, 2024

Blog Post

A new tool to unlock data from enterprise documents for generative AI

IBM Research BlogNov 12, 2024

Blog Post

AI is making extracting key information from reports easier than ever

IBM Research BlogFeb 22, 2024

Blog Post

IBM Research’s open-source toolkit for Deep Search

IBM Research BlogJul 11, 2022

Presentation

Deep Search for Scientific Discovery

IBM ResearchJul 7, 2022

Blog Post

Deep Document Understanding: IBM’s AI extracts data from complex documents

IBM Research BlogApr 15, 2021

Contributors

PS

Peter Staar

Peter Staar

Principal RSM; Master Inventor; Manager of `AI for Knowledge` group.

CA

Christoph Auer

Christoph Auer

Senior Research Scientist

CB

Cesar Berrospi Ramis

Cesar Berrospi Ramis

Senior Research Scientist

MD

Michele Dolfi

Michele Dolfi

Senior Technical Staff Member (STSM)

KD

Kasper Dinkla

Kasper Dinkla

YK

Yusik Kim

Yusik Kim

Research Scientist

VK

Viktor Kuropiatnyk

Viktor Kuropiatnyk

Software Engineer

NL

Nikos Livathinos

Nikos Livathinos

Senior Software Engineer & Certified Architect

ML

Maksym Lysak

Maksym Lysak

Software Engineering & R&D of AI Systems

IM

Ingmar Meijer

Ingmar Meijer

Senior Technical Staff Member

AN

Ahmed Nassar

Ahmed Nassar

Research Scientist

RT

Rafael Teixeira de Lima

Rafael Teixeira de Lima

Research scientist

PV

Panos Vagenas

Panos Vagenas

Advisory Engineer, Data & AI