Use deep search to explore the COVID-19 corpus
To help researchers access structured and unstructured data quickly, IBM Research has developed a cloud-based AI research service that has ingested a corpus of thousands of papers from the COVID-19 Open Research Dataset (CORD-19) and licensed databases from DrugBank, Clinicaltrials.gov and GenBank. This tool uses advanced AI, allowing users to make specific queries to the collections of papers and extract critical COVID-19 knowledge – including embedded text, tables and figures.
Corpus Conversion Service and Corpus Processing Service
To unlock the knowledge from the published unstructured and structured data on COVID-19, IBM researchers are making available two key technologies - the Corpus Conversion Service and Corpus Processing Service. Both are already in extensive use in the material science, automotive and energy industries.
The Corpus Conversion Service can ingest 100,000 PDF pages per day (even of scanned documents) on a single server — and then train and apply advanced machine learning models that extract the content from these documents with high accuracy at a scale never achieved before. We have applied this technology to thousands of PDFs on the coronavirus and COVID-19 and combined it with curated databases from DrugBank, Clinicaltrials.gov and GenBank.
The Corpus Processing Service integrates data from databases and publications into a knowledge graph, so that these can be queried to retrieve known facts and to generate novel insights.
Examples of the types of queries:
- Which drugs have been used so far and what are the outcomes
- Identify new, reported risk-factors
Corpus Processing Service features
The Corpus Conversion Service allows us to convert the latest PDF papers (e.g. from bioRxiv) into JSON documents. These can be ingested into a knowledge graph as unstructured data, allowing users to explore the latest published research.
The knowledge graph incorporates data from various sources, both unstructured (e.g. CORD-19 documents and converted PDF files) as well as structured (e.g. Drugbank, Genbank and clinical trials). The current knowledge graph contains approximately 4 million nodes and 50 million edges.
The knowledge graph will be updated and extended regularly to incorporate newly reported data.
For advanced users, we offer deep search capabilities. This allows users to build complex query workflows on the knowledge graph in order to obtain specific answers from the literature. Above, we show how we can search for evidence of what is the incubation time.
Please find reserach resources below that highlight the scientific advances that drive many of the Corpus Processing Service's capabilities.
An Information Extraction and Knowledge Graph Platform for Accelerating Biochemical Discoveries
Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale
Corpus Conversion Service Makes PDF Content Discoverable