IBM J. Res. Dev

Hybrid natural language processing for high-performance patent and literature mining in IBM Watson for Drug Discovery

View publication


IBM Watson for Drug Discovery (WDD) is a cognitive computing software platform for early stage pharmaceutical research. WDD extracts and cross-references life sciences information from very large-scale structured and unstructured data, identifying connections and correlations in an unbiased manner, and enabling more informed decision making through explainable analytics and scientific visualizations. This paper describes in detail the high-throughput natural language processing system implemented in WDD. This system enables a new WDD release every three weeks, comprising the latest publications as part of a continually growing corpus of over 30 million scientific and intellectual property documents, each reprocessed using the latest annotators and structured reference data to extract a set of domain-relevant entity and relationship concepts. The hybrid approach to natural language processing in WDD incorporates model- and rule-based techniques utilized in concert for high-performance named entity recognition, and a similar ensemble approach to named entity resolution tasks, culminating in semantic relationship extraction. Statistics on full-scale annotation results and example use cases are also provided.