To help researchers access structured and unstructured data quickly, IBM Research has developed a cloud-based AI research service that has ingested a corpus of thousands of papers from the COVID-19 Open Research Dataset (CORD-19) and licensed databases from DrugBank, and GenBank. This tool uses advanced AI, allowing users to make specific queries to the collections of papers and extract critical COVID-19 knowledge – including embedded text, tables and figures.




To unlock the knowledge from the published unstructured and structured data on COVID-19, IBM researchers are making available two key technologies - the Corpus Conversion Service and Corpus Processing Service. Both are already in extensive use in the material science, automotive and energy industries.

The Corpus Conversion Service can ingest 100,000 PDF pages per day (even of scanned documents) on a single server — and then train and apply advanced machine learning models that extract the content from these documents with high accuracy at a scale never achieved before. We have applied this technology to thousands of PDFs on the coronavirus and COVID-19 and combined it with curated databases from DrugBank, and GenBank.

The Corpus Processing Service integrates data from databases and publications into a knowledge graph, so that these can be queried to retrieve known facts and to generate novel insights.

Examples of the types of queries:

  1. Which drugs have been used so far and what are the outcomes
  2. Identify new, reported risk-factors

Corpus Processing Service features

The Corpus Conversion Service allows us to convert the latest PDF papers (e.g. from bioRxiv) into JSON documents. These can be ingested into a knowledge graph as unstructured data, allowing users to explore the latest published research.

The knowledge graph incorporates data from various sources, both unstructured (e.g. CORD-19 documents and converted PDF files) as well as structured (e.g. Drugbank, Genbank and clinical trials). The current knowledge graph contains approximately 4 million nodes and 50 million edges.

The knowledge graph will be updated and extended regularly to incorporate newly reported data.

For advanced users, we offer deep search capabilities. This allows users to build complex query workflows on the knowledge graph in order to obtain specific answers from the literature. Above, we show how we can search for evidence of what is the incubation time.

