Use deep search to explore the COVID-19 corpus
To help researchers access structured and unstructured data quickly, IBM Research has developed a cloud-based AI research service that has ingested a corpus of thousands of papers from the COVID-19 Open Research Dataset (CORD-19) and licensed databases from DrugBank, Clinicaltrials.gov and GenBank. This tool uses advanced AI, allowing users to make specific queries to the collections of papers and extract critical COVID-19 knowledge – including embedded text, tables and figures.
Last updated: 5-January-2021
Corpus Conversion Service and Corpus Processing Service
To unlock the knowledge from the published unstructured and structured data on COVID-19, IBM researchers are making available two key technologies - the Corpus Conversion Service and Corpus Processing Service. Both are already in extensive use in the material science, automotive and energy industries.
The Corpus Conversion Service can ingest 100,000 PDF pages per day (even of scanned documents) on a single server — and then train and apply advanced machine learning models that extract the content from these documents with high accuracy at a scale never achieved before. We have applied this technology to thousands of PDFs on the coronavirus and COVID-19 and combined it with curated databases from DrugBank, Clinicaltrials.gov and GenBank.
The Corpus Processing Service integrates data from databases and publications into a knowledge graph, so that these can be queried to retrieve known facts and to generate novel insights.
Examples of the types of queries:
- Which drugs have been used so far and what are the outcomes
- Identify new, reported risk-factors
Corpus Processing Service features
The Corpus Conversion Service allows us to convert the latest PDF papers (e.g. from bioRxiv) into JSON documents. These can be ingested into a knowledge graph as unstructured data, allowing users to explore the latest published research.
The knowledge graph incorporates data from various sources, both unstructured (e.g. CORD-19 documents and converted PDF files) as well as structured (e.g. Drugbank, Genbank and clinical trials). The current knowledge graph contains approximately 4 million nodes and 50 million edges.
The knowledge graph will be updated and extended regularly to incorporate newly reported data.
For advanced users, we offer deep search capabilities. This allows users to build complex query workflows on the knowledge graph in order to obtain specific answers from the literature. Above, we show how we can search for evidence of what is the incubation time.
Access will be granted to scientists and academics. The Deep Search service is an application that collects your name, email address, affiliation and intended uses for requesting to access the service. The Personal information collected will solely be used for the purpose of assessing if access will be granted to you and providing access to approved individuals to our site, content and use of services. The information collected will not be used for any other purpose. The information will be retained for 12-months. If you were granted access and no longer wish to have access, you can withdraw your request at any time by submitting a withdraw request.
DrugBank data is available under a CC-BY-NC 4.0 licence. The datasets can be used freely in a non-commercial application or project. If you are interested in using DrugBank data in a commercial product or application, please see the Drugbank release page.
Your request for access has been received. The IBM Research team will review your request and provide next steps.
Please find reserach resources below that highlight the scientific advances that drive many of the Corpus Processing Service's capabilities.
An Information Extraction and Knowledge Graph Platform for Accelerating Biochemical Discoveries
Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale
Corpus Conversion Service Makes PDF Content Discoverable